TLDR

Together With

TLDR Data 2026-06-11

Webinar: Unlocking first-party data for AI (Sponsor)

Join CEO Ben Brook on June 16 for a practical roadmap to first-party data activation and AI governance, plus a maturity diagnostic you can apply immediately.

> It starts with a real-time source of truth. Transcend encodes complete data-use permissions directly into the systems that process customer data, so every AI initiative and data product runs on a real-time source of truth. See how.

> 220+ enterprise IT and business leaders on why AI initiatives fail, and what to do about it.

Get the report →

📱

Deep Dives

Scaling Beyond One: How Airbnb Evolved Its Data Architecture for a Multi-product World (9 minute read)

Airbnb evolved its offline data architecture for a multi-product world with a flexible modeling framework that balances shared consistency with domain-specific needs. Its three principles are no hybrid models, consistent identifier naming, and clear namespaces so teams can separate product-specific models from cross-cutting monolithic ones.

Inside QuestDB's Query Engine: Tracing Three Queries (8 minute read)

QuestDB's time-series query engine appears tuple-at-a-time externally, but internally mixes vectorized execution, SIMD C++ kernels, Java batch processing, JIT filtering, and frame-based parallelism. Small SQL changes can shift execution paths, affecting group-by, filtering, and aggregation performance.

Parenting Iceberg and Lance with Gravitino: The Reality Behind Unified Lakehouse Architectures (8 minute read)

Apache Gravitino can govern Iceberg tables and Lance multimodal datasets through one metadata layer, RBAC model, and audit surface. Iceberg commits through the catalog, while Lance uses a two-step object-storage flow, with gotchas around config rewrites, jars, enum casing, and client drift.

🚀

Opinions & Advice

We had to build new evals for Fable (8 minute read)

Claude Fable 5 is a major step up for complex data analysis, scoring roughly 10-15% better than recent frontier models on Hex's evals and excelling at messy, long-horizon tasks that require judgment, clear assumptions, and cross-checking semantic models against raw data.

Dagster price increase 10x insane, don't ever use them (Reddit Thread)

Dagster's managed pricing jump has triggered backlash, pushing smaller users toward self-hosting, Airflow, Prefect, or simpler cron-style setups while still valuing Dagster OSS.

Why Metadata Has to Be Mutation-Friendly (10 minute read)

In high-update lakehouses, metadata becomes a high-mutation system. Apache Hudi's Merge-On-Read Metadata Table handles this with append-first writes and deferred compaction, reducing write cost and supporting scalable indexing more efficiently than Copy-On-Write designs.

When Event Time Meets Reality: Lessons from Building Billing on Apache Flink (12 minute read)

While building their usage-based billing pipeline, Gorgias experienced overlapping windows and incorrect aggregations during historical reprocessing due to internal repartitioning and uneven operator behavior that broke event-time guarantees. The team mitigated this by aligning keys across pipeline steps and applying conditional extra delays only during replays.

💻

Launches & Tools

PostgreSQL Anonymizer 3.1: Introducing Local Differential Privacy (2 minute read)

PostgreSQL Anonymizer 3.1 adds expanded masking for PII and sensitive data, with six masking strategies, including substitution, randomization, pseudonymization, shuffling, noise addition, and generalization. It now supports Local Differential Privacy via GRRM, providing formal privacy guarantees for survey and categorical data with privacy controlled by epsilon.

Introducing Loon: A New Storage Engine for Vector Data That Never Stops Changing (19 minute read)

Vector datasets evolve through backfills, embedding versions, and mixed workloads, not just vector columns. Loon, behind Milvus 3.0 beta and Zilliz Vector Lakebase, uses hybrid file formats, row-ID alignment, and versioned manifests so scalars, vectors, and object references can update independently with less rewriting.

Introducing Streamling: Performant and Extensible Data Streaming Runtime (7 minute read)

Streamling is an open-source Rust, Arrow, and DataFusion streaming runtime for transactional workloads rather than heavy analytics. It runs mostly single-node stateless pipelines with Kafka, Postgres, ClickHouse, HTTP enrichment, TypeScript/WASM transforms, plugins, checkpointing, and effectively-once delivery.

🎁

Miscellaneous

Scaling Zero Copy from 1 Trillion to 120 Trillion Rows with File Federation (5 minute read)

Zero Copy at Salesforce Data 360 evolved from Query Federation to Iceberg File Federation to support AI workloads across distributed enterprise data without centralizing it. The new architecture reduces cross-system compute overhead, preserves governance through temporary catalog-based access, and is being pushed by the need for real-time AI across major data platforms.

DataAgents: How we turned 9 months of analysis into 10 days (6 minute read)

Capital One's DataAgent pattern cut cloud dormancy analysis across about 350 AWS, Azure, and GCP resource types from 6-9 months to 10 days. It combines asset data, AI-generated Spark SQL, confidence scoring, false-positive checks, and human validation to find high-confidence savings opportunities.

⚡

Quick Links

HNSW vs. LSH: How Elasticsearch Hits 0.99 recall@10 at 15,000 QPS — and What It Costs (9 minute read)

Exact vector search fails at scale because of high dimensionality, making HNSW the dominant approach in Elasticsearch.

Why we shrank our TimescaleDB chunks from 30 days to 7 (4 minute read)

Warner Music Group reduced TimescaleDB chunk intervals from 30 days to 7 days on high-ingest hypertables after larger chunks caused compression failures, slower recent queries, and costly backfills.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://sparklp.co/32815a84/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Latest

Donate Your Car Now

Header Ads Widget

Fable Evals Performance ✅, Airbnb’s Evolved Data Architecture 🏘️, PostgreSQL Differential Privacy 🎭

TLDR Data 2026-06-11

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Report Abuse

Ad Space

Popular Posts

Keep Calm and LAGER On!

Tushar, please read: cancer treatment is out of reach for some

🕛 At the stroke of midnight, this 3X match ends.

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Get Lifetime Access To 1000+ Premium Online Training Courses For Just $59

Where to Buy Cheap Youtube Views?

Novell Zenworks MDM: Mobile Device Management For The Masses

Menu Footer Widget

Latest

Header Ads Widget

Fable Evals Performance ✅, Airbnb’s Evolved Data Architecture 🏘️, PostgreSQL Differential Privacy 🎭

TLDR Data 2026-06-11

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Social Plugin

Ad Space

Popular Posts

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Menu Footer Widget