Latest

6/recent/ticker-posts

Header Ads Widget

Scaling Metrics at Airbnb ๐Ÿ , Automated dbt Docs ๐Ÿ“š, Postgres Queue Pitfalls ๐Ÿงน

Airbnb migrated a massive StatsD-based metrics pipeline to OpenTelemetry and Prometheus using a dual-write strategy A shared metrics library ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

TLDR Data 2026-04-13

๐Ÿ“ฑ

Deep Dives

Building a high-volume metrics pipeline with OpenTelemetry and vmagent (9 minute read)

Airbnb migrated a massive StatsD-based metrics pipeline to OpenTelemetry and Prometheus using a dual-write strategy: OTLP for internal services, Prometheus for OSS workloads, with StatsD left as a fallback. A shared metrics library enabled broad rollout, but the highest-volume services hit memory, GC, and heap regressions, which were mitigated by switching select workloads to delta temporality. A two-layer vmagent aggregation tier scales to hundreds of aggregators and ingests 100M+ samples/sec.
Building Biz Ask Anything: From Prototype to Product (14 minute read)

Yelp built Business Attribute Assistant by automatically extracting and standardizing key attributes from millions of unstructured business text sources with human-in-the-loop validation, confidence scoring, automated monitoring, and iterative model improvements to keep its business listings accurate and up-to-date.
Building Hierarchical Agentic RAG Systems: Multi-Modal Reasoning with Autonomous Error Recovery (15 minute read)

Protocol-H, an open-source RAG framework, tackles the “modality gap” by using a hierarchical supervisor-worker architecture to combine SQL and vector search in multi-hop queries. On an internal EntQA benchmark, it significantly outperforms flat agents and standard RAG, at the cost of a higher p95 latency. The system adds deterministic orchestration, schema awareness, RBAC-aligned access, and autonomous retry/recovery for auditability and compliance.
๐Ÿš€

Opinions & Advice

Keeping a Postgres queue healthy (17 minute read)

Postgres job queues often degrade not because of performance limits, but because deleted rows (“dead tuples”) pile up when cleanup (vacuuming) is blocked by long or overlapping queries from other workloads. Over time, this creates hidden overhead and slows everything down. The solution is to control and limit competing query traffic so vacuuming can run effectively, keeping the queue and database stable.
Why Shannon Entropy Catches What Schema Validation Misses (8 minute read)

Traditional data checks can pass even when pipelines are semantically broken: schema, row count, null rates, and freshness don't detect distribution collapse, over-merging, or silent information loss. Shannon entropy can be used as a signal-integrity metric to monitor drift over time or information preservation across transformations.
Your AI Dashboard Looks Cool. Nobody Learns Anything From It (9 minute read)

To build a useful dashboard with AI, start with a clear question to keep it focused (one main insight per view), match the chart type to the question type, and use descriptive names and comments for AI to understand your intent. Finally, make sure your AI dashboard tells a story instead of just displaying numbers.
๐Ÿ’ป

Launches & Tools

Docglow (GitHub Repo)

Docglow is an open-source tool that turns dbt projects into a cleaner, interactive documentation experience with lineage, search, and AI chat, making data models easier to explore than standard dbt docs. It's a lightweight, self-hosted alternative to heavier data catalog tools, focused on usability for both technical and business users.
Opendataloader-pdf (GitHub Repo)

OpenDataLoader PDF is an open-source parser that turns PDFs into AI-ready Markdown, JSON (with bounding boxes), and HTML. It supports deterministic local extraction plus an AI hybrid mode for hard cases like scanned PDFs, complex tables, formulas, and charts.
kuva (GitHub Repo)

kuva is a Rust plotting library plus CLI for turning CSV/TSV data into publication-style visuals fast: 30 plot types, SVG by default, optional PNG/PDF output, and even terminal rendering. It's a lightweight way to add reproducible chart generation to pipelines, scripts, and shell workflows without hauling in a full Python plotting stack.
Deep Dive into Apache Iceberg Architecture: The Three Layers That Power Your Lakehouse (9 minute read)

Apache Iceberg's architecture powers modern lakehouses with three distinct layers: the Catalog Layer manages metadata pointers and atomic commits, the Metadata Layer stores immutable files with schema, partitions, and snapshot history, and the Data Layer holds the actual Parquet data files, together enabling ACID transactions, schema evolution, time travel, and efficient querying at massive scale.
๐ŸŽ

Miscellaneous

Market Leaders vs Challengers: the Ongoing Battle for Data Catalogs in Data Lakehouse (5 minute read)

Data catalogs are becoming the control layer for lakehouses, managing governance, access, and interoperability across data stacks. Managed options are easier but lock you in, while open-source tools offer flexibility and multi-engine support at the cost of maturity, so many teams will need both a technical and business-facing catalog.
The broken economics of databases (9 minute read)

Database vendors look insanely profitable on gross margin, yet stay barely profitable because R&D and go-to-market costs are huge. As databases commoditize and hyperscalers own the infrastructure, vendors defend margins with differentiation, pricing opacity, and growing operational complexity. Net effect for data engineers: products often get more features but not simpler operations, because complexity itself helps preserve vendor economics.

Quick Links

Streaming sort-merge joins in Polars (6 minute read)

New streaming merge join cuts join time up to 18× by dropping hash builds when keys are presorted.
Your harness, Your memory (8 minute read)

Agent harnesses control how memory works, so if you use a closed or API-based harness, you don't really own your agent's memory.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? ๐Ÿ“ฐ

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? ๐Ÿ’ผ

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments