Latest

6/recent/ticker-posts

Header Ads Widget

Data Talent Time Bomb ๐Ÿงจ, Data Activation Wins ๐Ÿฅ‡, ClickHouse Buys Langfuse ๐Ÿค

The real risk to data platforms in 2026 is a talent crisis, with too few junior roles, high burnout, and AI unable to replace human judgment ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

TLDR Data 2026-01-19

๐Ÿ“ฑ

Deep Dives

Apache Hudi at Uber: Engineering for Trillion-Record-Scale Data Lake Operations (14 minute read)

Uber powers its massive transactional data lake with Apache Hudi, ingesting 6 trillion rows daily through innovations like the Metadata Table for O(1) key lookup, Record Index allowing 1-2 ms for key lookups, HiveSync for multi-data-center replication, and 10x faster compaction from row-group-level merging strategy, effectively solving the small-file issue in streaming workloads.
A Practical Approach to Replenishment Optimization with Extended (R, s, Q) Policy and Probabilistic Models (10 minute read)

Zalando combines probabilistic forecasting and discrete event simulation to bring inventory theory to operational scale. Its inventory optimization tool leverages probabilistic LightGBM-based demand forecasting, an extended (R, s, Q) replenishment policy, and Monte Carlo-driven discrete simulation to optimize inventory decisions under uncertainty. Tested on 2 million articles across 800 merchants, the system consistently outperforms traditional approaches, delivering superior cost control and minimizing both overstock and stock-outs.
Graph neural networks at Faire (8 minute read)

Faire replaced traditional FM-based product recommendations with a graph neural network (GNN) leveraging a bipartite engagement graph weighted by interaction type and recency. This GNN, using a GAT-based two-tower architecture and time-decayed edge weighting, improved recall@10 by +25.8% offline and drove a +4.85% lift in orders in live A/B testing. Its ability to generalize with sparse direct engagement signals and maintain stable, up-to-date embeddings makes this model a great fit for large-scale, dynamic marketplaces.
๐Ÿš€

Opinions & Advice

The 2026 Data Engineering Strategy Nobody's Writing (But Everyone Needs) (11 minute read)

The real risk to data platforms in 2026 is a talent crisis, with too few junior roles, high burnout, and AI unable to replace human judgment. Teams also overbuild distributed systems for small workloads, where simpler tools could cut costs by up to 90 percent. The fix is to invest in junior talent, mentorship, and retention to avoid future workforce gaps.
Data Activation Thoughts (5 minute read)

Traditional data moats are fading. The real advantage now comes from data activation: how well you transform proprietary data into structured forms that LLMs can actually use to improve performance. Healthcare shows this clearly - research proves that structured reasoning scaffolds can unlock major gains from EHR data, but the right transformation methods are still an open problem.
Stop Using MySQL in 2026, It Is Not True Open Source (9 minute read)

MySQL's community health and technical quality are declining under Oracle's stewardship, which weakens its credibility as a truly open-source project and increases long-term risk for users. Switching to MariaDB or other open databases is now easy, safer, and the more future-proof choice.
Realist's Guide to Hybrid Mesh Architecture (1): Single Source of Truth vs Democratisation (8 minute read)

Hybrid mesh architectures try to balance team autonomy with a single source of truth, but in practice, they often struggle and end up relying on centralized governance anyway. The "Constellation Architecture" keeps teams independent while centralizing the integration and shared data layers, reducing complexity and making ownership and rules clearer.
๐Ÿ’ป

Launches & Tools

Alternatives to MinIO for single-node local S3 (8 minute read)

MinIO was abandoned in late 2025, sparking a search for reliable open-source S3-compatible alternatives for single-node local use. For demo and lightweight scenarios, SeaweedFS and S3Proxy stand out for their simplicity, maturity, and ease of setup, with strong and active community governance.
Apache Arrow ADBC Database Drivers (3 minute read)

Apache Arrow ADBC (Arrow Database Connectivity) is a modern, high-performance alternative to legacy ODBC/JDBC drivers that enables direct end-to-end movement of Arrow RecordBatches between applications and databases without row-by-row marshaling, copies, or heavy serialization overhead.
Arroyo – Distributed Stream Processing Engine in Rust (GitHub Repo)

Arroyo is a cloud-native stream processing engine written in Rust that delivers low-latency, exactly-once streaming with a SQL-first interface. It aims to be a simpler, more developer-friendly alternative to Flink by combining high performance (no JVM), strong correctness guarantees, and easy deployment for real-time analytics workloads.
๐ŸŽ

Miscellaneous

ClickHouse welcomes Langfuse: The future of open-source LLM observability (8 minute read)

ClickHouse acquired Langfuse to pair LLM observability and prompt management with high-performance analytics, strengthening its open-source stack for building production AI apps.
The Real Reason IBM Acquired Confluent for $11bn (11 minute read)

IBM's $11B acquisition of Confluent signals that real-time context is becoming core to enterprise AI, with Kafka positioned as foundational infrastructure for AI agents. Confluent was growing strongly yet valued well below its peers, making the deal strategically and financially compelling. The move confirms that context engineering is now a key battleground, with rivals like Redpanda and Akka racing to compete.

Quick Links

PostgreSQL Table Design Skill for Agents (GitHub Repo)

A concise reference for designing PostgreSQL tables that explains which data types, constraints, and indexes to use to keep schemas correct, fast, and maintainable.
SoccerData (GitHub Repo)

SoccerData is a Python package that provides scrapers for multiple soccer data sources.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? ๐Ÿ“ฐ

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? ๐Ÿ’ผ

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments