How Klaviyo Built an Unbreakable System for Running Distributed ML Workloads (11 minute read) Klaviyo's DART Jobs API enables seamless, distributed ML workload orchestration across multi-cluster Kubernetes infrastructures. Leveraging Ray for scalable Python and AI execution, the architecture decouples coordination from execution via a central MySQL-backed state machine and sync services in each cluster, ensuring consistent job state, robust error handling, and strict resource isolation between development and production. Developers benefit from reproducible, rapid iteration with Python and CLI interfaces, while automated infrastructure management abstracts away complexity. | How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points (8 minute read) Driven by needs for reliability, privacy, and efficiency amid massive growth, Spotify evolved from a single-team Hadoop cluster to a scalable, self-service, cloud-native platform leveraging Kubernetes Operators for automation, built-in privacy safeguards, and Backstage-powered observability with alerts. This new data platform captures 1.4T daily events and runs over 38,000 pipelines. | Deep Dive Into Hudi's Indexing Subsystem (11 minute read) Apache Hudi's metadata table uses specialized indexing to optimize complex query patterns beyond basic data skipping, including record and secondary indexes for exact file location lookups on equality predicates, expression indexes (column stats or bloom filters) for optimizing queries with inline transformations, and async indexing in the background without blocking reads or writes. | | The Case Against pgvector (13 minute read) While pgvector makes vector search appear easy by extending PostgreSQL, in production, it has serious gaps: index types (IVFFlat and HNSW) require manual tuning and heavy memory usage, real-time inserts lead to build/rebuild bottlenecks, filtered queries suffer from planner mismatches, and hybrid search needs DIY integration. While pgvector works, it comes at the cost of operational complexity. For most teams, a dedicated vector DB may be the simpler, more reliable route. | Colocating Input Partitions with Kafka Streams When Consuming Multiple Topics: Sub-Topology Matters! (4 minute read) When consuming two identically partitioned Kafka topics, Kafka Streams may assign same-index partitions to different instances if processed in separate sub-topologies, breaking local cache reuse and triggering duplicate API calls for identical keys. Unify sub-topologies using a shared state store to enforce partition colocation and enable efficient cross-topic coordination without joins, as topology design is critical to distributed behavior in Kafka Streams. | Data Warehouse, Data Lake, Data Lakehouse, Data Mesh: What They Are and How They Differ (15 minute read) Data Warehouses offer fast, governed analytics on structured data, but are rigid and costly, while Data Lakes enable cheap storage of any data type for ML and exploration, but risk becoming ungoverned swamps. Data Lakehouses unify both with ACID and open formats for mixed workloads, while Data Mesh decentralizes ownership as data products and is ideal for large, mature organizations with strong self-service infrastructure. | | What's New in Dash 3.3.0 (7 minute read) Dash 3.3.0 brings major upgrades for day-to-day data app development: fully customizable developer tools, optional and hidden callbacks for cleaner logic, and a new Patch API for fast, surgical client-side figure updates without full re-renders. You can build and share your own dev-tool plugins, profile callbacks, and integrate custom React components directly into the debugging workflow. The release also adds Python 3.14 support and encourages migration from Dash Table to Dash Ag Grid for richer, more performant data grids. | The Delta Join in Apache Flink: Architectural Decoupling for Hyper-Scale Stream Processing (20 minute read) Delta Join (FLIP‐486) decouples state from streaming joins by routing lookups to external storage rather than retaining all historical data in Flink's checkpointed state. This enables production users to eliminate Terabytes of join state, cut compute costs by an order of magnitude, and dramatically increase recovery speed. The trade-off: you accept external lookup latency in exchange for scalability and operational resilience. | BigQuery Under the Hood: How Google Brought Embeddings to Analytics (5 minute read) Google's BigQuery vector search democratizes AI-driven similarity searches by natively embedding vector capabilities into its serverless data platform, allowing users to generate, index, and query embeddings. BigQuery vector search is now enhanced with TreeAH (ScaNN-based) for high-throughput batch tasks, async index training, stored columns for prefiltering, and partitioned indexes to skip irrelevant data. | | Do You Really Need GraphRAG? A Practitioner's Guide Beyond the Hype (15 minute read) GraphRAG adds entity and relation-aware reasoning on top of traditional RAG, unlocking cross-document queries, explainability, and massive search-space reduction. However, it comes at a real cost in complexity. GraphRAG is justified for long or relational documents (e.g., investigations and medical cases) but is overkill for independent texts. Start with a star-graph schema, expand iteratively, use graphs as classifiers, not responders, and control semantic fallbacks to avoid hallucinated links. | | PostgreSQL 18: More Performance with Index Skip Scans (3 minute read) PostgreSQL 18's index skip scans automatically optimize composite B-tree queries by skipping non-qualifying leading column values, jumping directly to the next group after processing matching rare values, delivering up to 100x speedups (66 ms → 0.6 ms) in low-cardinality scenarios. | | | Love TLDR? Tell your friends and get rewards! | | Share your referral link below with friends to get free TLDR swag! | | | | Track your referrals here. | | | |
0 Comments