How Pinterest Transfers Hundreds of Terabytes of Data With CDC (5 minute read) Pinterest transfers hundreds of terabytes daily from numerous sharded MySQL sources to analytical systems using Kafka Connect and Debezium. By separating configuration and streaming logic, they ensure safe connector updates, reroute partitions, and maintain exactly-once semantics under heavy load. They prevent data duplication during recovery with metadata synchronization and idempotent replays, and enforce schema evolution rules. | Rebuilding Uber's Apache Pinot Query Architecture (10 minute read) Uber transitioned from a complex, layered Presto-based (Neutrino) architecture to a simplified, brokerless design using Pinot's new Multi-Stage Engine (MSE) Lite Mode to enhance reliability, reduce latency, and support complex OLAP queries at scale. This enhancement powers hundreds of millions of daily low-latency queries for user analytics, log search, and tracing. | Behind the Streams: Real-Time Recommendations for Live Events Part 3 (8 minute read) Netflix deployed a two-phase approach to deliver real-time recommendations during live events like NFL games, where it must align precisely with event timing to avoid missed moments while handling massive concurrent demand across 100M+ devices. The two phases include prefetching recommendations via GraphQL to the Domain Graph Service during routine browsing to evenly distribute load, followed by broadcasting low-cardinality messages (state keys and timestamps) via WebSockets to all devices at event critical moments. | Mastering RAG: How To Architect An Enterprise RAG System (94 minute read) Building a robust enterprise RAG system requires a modular architecture with strong authentication, input/output guardrails, advanced query rewriting, rigorous custom encoder selection, scalable document ingestion, and careful vector database choices. Implement reranking, hybrid search, and diverse chunking methods to reduce irrelevant or incomplete answers. Continuous LLM observability, user feedback loops, and caching drive system reliability, efficiency, and cost control. | | Why Analytics Agents Break Differently (5 minute read) Analytics agents fail differently from coding ones because data can't be summarized without losing meaning. Hex learned that overloading models with raw data breaks reasoning, so it built structured context maps, set strict token limits, and made truncation explicit to help agents navigate and analyze data more effectively. | Build Your Own Database (10 minute read) Building a key-value database from scratch reveals how simple file storage evolves into scalable systems. Append-only files with compaction improve durability, while indexing and sorting boost read speed at the cost of slower writes. These ideas form the basis of Log-Structured Merge Trees, used in databases like LevelDB and DynamoDB. | Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI (18 minute read) RAG was only the starting point of the Context Engineering discipline. Modern context engineering now incorporates context writing, compression, isolation, and selection, demanding robust metadata management, policy-as-code guardrails, and multimodal capabilities. Knowledge graphs underpin explainable, trustworthy, and scalable AI, while new evaluation metrics (relevance, groundedness, provenance, recency) underpin enterprise-grade solutions. | The Death of Thread Per Core (3 minute read) Async runtimes like those in Rust shift from rigid thread-per-core models toward dynamic, work-stealing approaches, enabling better workload balancing, especially under skewed or unpredictable data distributions. Modern query engines benefit from granular scheduling, leveraging predictive insights about tasks to merge or split workloads efficiently. This shared-state concurrency improves elasticity and better addresses scaling and multitenancy challenges, making dynamic task reshuffling integral to resilient, high-performance data processing platforms. | | Building with Google Cloud databases: a collection of 70+ case studies (Sponsor) Learn how 70+ companies improved performance, scaled globally, and optimized costs using Google Cloud's managed database services: AlloyDB, Cloud SQL, Spanner, Memorystore, Bigtable, and Firestore. Each case study is a one-pager that distills the key insights from deployments at companies like Macy's, Wayfair, Yahoo, and many others. Get the resource | DuckDB Tera Extension (Tool) Query.Farm's Tera extension adds template rendering directly into DuckDB, allowing SQL queries to dynamically generate text, HTML, JSON, or configuration files using the Tera templating engine. It lets you embed variables, loops, and conditions inside templates to produce formatted reports, API responses, or configuration outputs directly from database data without leaving SQL. | IndexTables for Spark (GitHub Repo) Built by the IndexTables project, this Apache-2.0-licensed library adds a high-performance, full-text-search-capable open table format integrated with Spark SQL. It enables SQL queries with full text search and fast retrieval across large-scale data sets. Still experimental and less mature than mainstream formats, it may offer value when search-style queries dominate and Hadoop/Spark is already in use. | Databases Without an OS? Meet QuinineHM (11 minute read) QuinineHM is a "Hardware Manager" that replaces the operating system to run databases directly on bare metal. This removes context-switch and scheduler overhead, exposing CPUs, memory, and NICs directly to workloads for deterministic speed and near-zero attack surface. Its first product, TonicDB, a Redis-compatible in-memory DB, runs up to 20x faster and 3x cheaper. | | Identify User Journeys at Pinterest (8 minute read) By modeling user journeys as hierarchical clusters of activities (searches, Pins, boards), Pinterest shifts from short-term interests to personalized, intent-driven recommendations, addressing limited training data for new journey-focused products, with lean, foundation-model-based techniques. The pipeline includes extracting keywords from activities, clustering/embedding them, ranking/naming/expanding journeys, predict stages (situational/evergreen), and output scored lists. | Apache Flink Watermarks…WTF? (Website) This interactive website visually illustrates how Flink uses watermarks to manage event-time in streams and establish when it's safe to treat earlier timestamps as complete and trigger windows. Key takeaways: generate timestamps early, use a strategy tailored to your data's out-of-order characteristics, and remember that in multi-input operators the watermark advances only as fast as the slowest upstream source, so skew or idle partitions can bottleneck your pipeline. | | Love TLDR? Tell your friends and get rewards! | Share your referral link below with friends to get free TLDR swag! | | Track your referrals here. | | | |
0 Comments