Latest

6/recent/ticker-posts

Header Ads Widget

Pinterest’s Streaming Resilience 🛡️, Why Analytics Agents Break ⚙️, From Presto to Pinot MSE 🍷

Pinterest transfers hundreds of terabytes daily from numerous sharded MySQL sources to analytical systems using Kafka Connect and Debezium. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

Together With AWS

TLDR Data 2025-10-23

Data architecture tools and guides for your next AI project (Sponsor)

Accelerate your AI/ML initiatives with enterprise-ready solutions in AWS Marketplace. From vector databases to ML workflow orchestration, explore our tools today to scale your AI applications while maintaining security standards.

Discover our technical guides to streamline implementation, or start your free trial to see how these solutions can transform your AI development journey from proof-of-concept to production.

📱

Deep Dives

How Pinterest Transfers Hundreds of Terabytes of Data With CDC (5 minute read)

Pinterest transfers hundreds of terabytes daily from numerous sharded MySQL sources to analytical systems using Kafka Connect and Debezium. By separating configuration and streaming logic, they ensure safe connector updates, reroute partitions, and maintain exactly-once semantics under heavy load. They prevent data duplication during recovery with metadata synchronization and idempotent replays, and enforce schema evolution rules.
Rebuilding Uber's Apache Pinot Query Architecture (10 minute read)

Uber transitioned from a complex, layered Presto-based (Neutrino) architecture to a simplified, brokerless design using Pinot's new Multi-Stage Engine (MSE) Lite Mode to enhance reliability, reduce latency, and support complex OLAP queries at scale. This enhancement powers hundreds of millions of daily low-latency queries for user analytics, log search, and tracing.
Behind the Streams: Real-Time Recommendations for Live Events Part 3 (8 minute read)

Netflix deployed a two-phase approach to deliver real-time recommendations during live events like NFL games, where it must align precisely with event timing to avoid missed moments while handling massive concurrent demand across 100M+ devices. The two phases include prefetching recommendations via GraphQL to the Domain Graph Service during routine browsing to evenly distribute load, followed by broadcasting low-cardinality messages (state keys and timestamps) via WebSockets to all devices at event critical moments.
Mastering RAG: How To Architect An Enterprise RAG System (94 minute read)

Building a robust enterprise RAG system requires a modular architecture with strong authentication, input/output guardrails, advanced query rewriting, rigorous custom encoder selection, scalable document ingestion, and careful vector database choices. Implement reranking, hybrid search, and diverse chunking methods to reduce irrelevant or incomplete answers. Continuous LLM observability, user feedback loops, and caching drive system reliability, efficiency, and cost control.
🚀

Opinions & Advice

Why Analytics Agents Break Differently (5 minute read)

Analytics agents fail differently from coding ones because data can't be summarized without losing meaning. Hex learned that overloading models with raw data breaks reasoning, so it built structured context maps, set strict token limits, and made truncation explicit to help agents navigate and analyze data more effectively.
Build Your Own Database (10 minute read)

Building a key-value database from scratch reveals how simple file storage evolves into scalable systems. Append-only files with compaction improve durability, while indexing and sorting boost read speed at the cost of slower writes. These ideas form the basis of Log-Structured Merge Trees, used in databases like LevelDB and DynamoDB.
Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI (18 minute read)

RAG was only the starting point of the Context Engineering discipline. Modern context engineering now incorporates context writing, compression, isolation, and selection, demanding robust metadata management, policy-as-code guardrails, and multimodal capabilities. Knowledge graphs underpin explainable, trustworthy, and scalable AI, while new evaluation metrics (relevance, groundedness, provenance, recency) underpin enterprise-grade solutions.
The Death of Thread Per Core (3 minute read)

Async runtimes like those in Rust shift from rigid thread-per-core models toward dynamic, work-stealing approaches, enabling better workload balancing, especially under skewed or unpredictable data distributions. Modern query engines benefit from granular scheduling, leveraging predictive insights about tasks to merge or split workloads efficiently. This shared-state concurrency improves elasticity and better addresses scaling and multitenancy challenges, making dynamic task reshuffling integral to resilient, high-performance data processing platforms.
💻

Launches & Tools

Building with Google Cloud databases: a collection of 70+ case studies (Sponsor)

Learn how 70+ companies improved performance, scaled globally, and optimized costs using Google Cloud's managed database services: AlloyDB, Cloud SQL, Spanner, Memorystore, Bigtable, and Firestore. Each case study is a one-pager that distills the key insights from deployments at companies like Macy's, Wayfair, Yahoo, and many others. Get the resource
DuckDB Tera Extension (Tool)

Query.Farm's Tera extension adds template rendering directly into DuckDB, allowing SQL queries to dynamically generate text, HTML, JSON, or configuration files using the Tera templating engine. It lets you embed variables, loops, and conditions inside templates to produce formatted reports, API responses, or configuration outputs directly from database data without leaving SQL.
IndexTables for Spark (GitHub Repo)

Built by the IndexTables project, this Apache-2.0-licensed library adds a high-performance, full-text-search-capable open table format integrated with Spark SQL. It enables SQL queries with full text search and fast retrieval across large-scale data sets. Still experimental and less mature than mainstream formats, it may offer value when search-style queries dominate and Hadoop/Spark is already in use.
Databases Without an OS? Meet QuinineHM (11 minute read)

QuinineHM is a "Hardware Manager" that replaces the operating system to run databases directly on bare metal. This removes context-switch and scheduler overhead, exposing CPUs, memory, and NICs directly to workloads for deterministic speed and near-zero attack surface. Its first product, TonicDB, a Redis-compatible in-memory DB, runs up to 20x faster and 3x cheaper.
🎁

Miscellaneous

Identify User Journeys at Pinterest (8 minute read)

By modeling user journeys as hierarchical clusters of activities (searches, Pins, boards), Pinterest shifts from short-term interests to personalized, intent-driven recommendations, addressing limited training data for new journey-focused products, with lean, foundation-model-based techniques. The pipeline includes extracting keywords from activities, clustering/embedding them, ranking/naming/expanding journeys, predict stages (situational/evergreen), and output scored lists.
Apache Flink Watermarks…WTF? (Website)

This interactive website visually illustrates how Flink uses watermarks to manage event-time in streams and establish when it's safe to treat earlier timestamps as complete and trigger windows. Key takeaways: generate timestamps early, use a strategy tailored to your data's out-of-order characteristics, and remember that in multi-input operators the watermark advances only as fast as the slowest upstream source, so skew or idle partitions can bottleneck your pipeline.

Quick Links

Postgres for Agents (5 minute read)

Agentic Postgres is a version of PostgreSQL built for AI agents.
Modernising Grab's Model Serving Platform with NVIDIA Triton Inference Server (6 minute read)

Migrating to Triton cut p90 latency 6x and reduced infrastructure costs by ~20% across half of Grab's online ML deployments.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments