Latest

6/recent/ticker-posts

Header Ads Widget

pgvector’s Hidden Complexity ⚠️, Spotify’s Privacy-First Architecture 🔒, State-Driven ML Operations 🔄

While pgvector makes vector search appear easy by extending PostgreSQL, in production, it has serious gaps: index types require manual tuning ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

Together With Ahrefs

TLDR Data 2025-11-13

Ahrefs built the world's largest AI visibility database (Sponsor)

Brands are scrambling to understand the impact of AI on their marketing efforts. Ahrefs has built the data infrastructure that enables them to do so in an evidence-backed manner, rather than relying on semi-random sample prompts.

Brand Radar is built on the world's largest AI visibility database - powered by 6 massive indexes and tracking 150M+ total prompts. With no setup time or waiting, companies can plug into this database and understand what ChatGPT and other LLM tools are saying about their products and services.

Get access to Brand Radar >

📱

Deep Dives

How Klaviyo Built an Unbreakable System for Running Distributed ML Workloads (11 minute read)

Klaviyo's DART Jobs API enables seamless, distributed ML workload orchestration across multi-cluster Kubernetes infrastructures. Leveraging Ray for scalable Python and AI execution, the architecture decouples coordination from execution via a central MySQL-backed state machine and sync services in each cluster, ensuring consistent job state, robust error handling, and strict resource isolation between development and production. Developers benefit from reproducible, rapid iteration with Python and CLI interfaces, while automated infrastructure management abstracts away complexity.
How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points (8 minute read)

Driven by needs for reliability, privacy, and efficiency amid massive growth, Spotify evolved from a single-team Hadoop cluster to a scalable, self-service, cloud-native platform leveraging Kubernetes Operators for automation, built-in privacy safeguards, and Backstage-powered observability with alerts. This new data platform captures 1.4T daily events and runs over 38,000 pipelines.
Deep Dive Into Hudi's Indexing Subsystem (11 minute read)

Apache Hudi's metadata table uses specialized indexing to optimize complex query patterns beyond basic data skipping, including record and secondary indexes for exact file location lookups on equality predicates, expression indexes (column stats or bloom filters) for optimizing queries with inline transformations, and async indexing in the background without blocking reads or writes.
Meta's Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation (9 minute read)

Meta's Generative Ads Recommendation Model (GEM) is the industry's largest foundation model for recommendation systems (RecSys) and is trained at LLM-scale across thousands of GPUs. It introduces scalable architecture and post-training techniques to enhance ad personalization, delivering 5% higher conversions on Instagram and 3% on Facebook Feed.
🚀

Opinions & Advice

The Case Against pgvector (13 minute read)

While pgvector makes vector search appear easy by extending PostgreSQL, in production, it has serious gaps: index types (IVFFlat and HNSW) require manual tuning and heavy memory usage, real-time inserts lead to build/rebuild bottlenecks, filtered queries suffer from planner mismatches, and hybrid search needs DIY integration. While pgvector works, it comes at the cost of operational complexity. For most teams, a dedicated vector DB may be the simpler, more reliable route.
Colocating Input Partitions with Kafka Streams When Consuming Multiple Topics: Sub-Topology Matters! (4 minute read)

When consuming two identically partitioned Kafka topics, Kafka Streams may assign same-index partitions to different instances if processed in separate sub-topologies, breaking local cache reuse and triggering duplicate API calls for identical keys. Unify sub-topologies using a shared state store to enforce partition colocation and enable efficient cross-topic coordination without joins, as topology design is critical to distributed behavior in Kafka Streams.
Data Warehouse, Data Lake, Data Lakehouse, Data Mesh: What They Are and How They Differ (15 minute read)

Data Warehouses offer fast, governed analytics on structured data, but are rigid and costly, while Data Lakes enable cheap storage of any data type for ML and exploration, but risk becoming ungoverned swamps. Data Lakehouses unify both with ACID and open formats for mixed workloads, while Data Mesh decentralizes ownership as data products and is ideal for large, mature organizations with strong self-service infrastructure.
💻

Launches & Tools

Data teams lose over $21,613 per analyst each year due to inefficient processes (Sponsor)

dbt Labs surveyed 510 analysts to see where analysts are actually spending their time (spoiler: less than 25% is spent on actual analysis). Learn why more than half of analysts use AI tools outside approved systems — and what they need to go from busywork to getting things done. If you are tired of spending more time fixing data than analyzing it, this report will show you what needs to change. Read the report
What's New in Dash 3.3.0 (7 minute read)

Dash 3.3.0 brings major upgrades for day-to-day data app development: fully customizable developer tools, optional and hidden callbacks for cleaner logic, and a new Patch API for fast, surgical client-side figure updates without full re-renders. You can build and share your own dev-tool plugins, profile callbacks, and integrate custom React components directly into the debugging workflow. The release also adds Python 3.14 support and encourages migration from Dash Table to Dash Ag Grid for richer, more performant data grids.
The Delta Join in Apache Flink: Architectural Decoupling for Hyper-Scale Stream Processing (20 minute read)

Delta Join (FLIP‐486) decouples state from streaming joins by routing lookups to external storage rather than retaining all historical data in Flink's checkpointed state. This enables production users to eliminate Terabytes of join state, cut compute costs by an order of magnitude, and dramatically increase recovery speed. The trade-off: you accept external lookup latency in exchange for scalability and operational resilience.
BigQuery Under the Hood: How Google Brought Embeddings to Analytics (5 minute read)

Google's BigQuery vector search democratizes AI-driven similarity searches by natively embedding vector capabilities into its serverless data platform, allowing users to generate, index, and query embeddings. BigQuery vector search is now enhanced with TreeAH (ScaNN-based) for high-throughput batch tasks, async index training, stored columns for prefiltering, and partitioned indexes to skip irrelevant data.
🎁

Miscellaneous

Do You Really Need GraphRAG? A Practitioner's Guide Beyond the Hype (15 minute read)

GraphRAG adds entity and relation-aware reasoning on top of traditional RAG, unlocking cross-document queries, explainability, and massive search-space reduction. However, it comes at a real cost in complexity. GraphRAG is justified for long or relational documents (e.g., investigations and medical cases) but is overkill for independent texts. Start with a star-graph schema, expand iteratively, use graphs as classifiers, not responders, and control semantic fallbacks to avoid hallucinated links.

Quick Links

What Does the End of GIL Mean for Python? (5 minute read)

Python is removing the Global Interpreter Lock (GIL) through PEP 703, enabling true parallel processing across CPU cores.
PostgreSQL 18: More Performance with Index Skip Scans (3 minute read)

PostgreSQL 18's index skip scans automatically optimize composite B-tree queries by skipping non-qualifying leading column values, jumping directly to the next group after processing matching rare values, delivering up to 100x speedups (66 ms → 0.6 ms) in low-cardinality scenarios.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments