Building a high-volume metrics pipeline with OpenTelemetry and vmagent (9 minute read)
Airbnb migrated a massive StatsD-based metrics pipeline to OpenTelemetry and Prometheus using a dual-write strategy: OTLP for internal services, Prometheus for OSS workloads, with StatsD left as a fallback. A shared metrics library enabled broad rollout, but the highest-volume services hit memory, GC, and heap regressions, which were mitigated by switching select workloads to delta temporality. A two-layer vmagent aggregation tier scales to hundreds of aggregators and ingests 100M+ samples/sec.
|
Building Biz Ask Anything: From Prototype to Product (14 minute read)
Yelp built Business Attribute Assistant by automatically extracting and standardizing key attributes from millions of unstructured business text sources with human-in-the-loop validation, confidence scoring, automated monitoring, and iterative model improvements to keep its business listings accurate and up-to-date.
|
Building Hierarchical Agentic RAG Systems: Multi-Modal Reasoning with Autonomous Error Recovery (15 minute read)
Protocol-H, an open-source RAG framework, tackles the “modality gap” by using a hierarchical supervisor-worker architecture to combine SQL and vector search in multi-hop queries. On an internal EntQA benchmark, it significantly outperforms flat agents and standard RAG, at the cost of a higher p95 latency. The system adds deterministic orchestration, schema awareness, RBAC-aligned access, and autonomous retry/recovery for auditability and compliance.
|
|
Keeping a Postgres queue healthy (17 minute read)
Postgres job queues often degrade not because of performance limits, but because deleted rows (“dead tuples”) pile up when cleanup (vacuuming) is blocked by long or overlapping queries from other workloads. Over time, this creates hidden overhead and slows everything down. The solution is to control and limit competing query traffic so vacuuming can run effectively, keeping the queue and database stable.
|
Why Shannon Entropy Catches What Schema Validation Misses (8 minute read)
Traditional data checks can pass even when pipelines are semantically broken: schema, row count, null rates, and freshness don't detect distribution collapse, over-merging, or silent information loss. Shannon entropy can be used as a signal-integrity metric to monitor drift over time or information preservation across transformations.
|
Your AI Dashboard Looks Cool. Nobody Learns Anything From It (9 minute read)
To build a useful dashboard with AI, start with a clear question to keep it focused (one main insight per view), match the chart type to the question type, and use descriptive names and comments for AI to understand your intent. Finally, make sure your AI dashboard tells a story instead of just displaying numbers.
|
|
Docglow (GitHub Repo)
Docglow is an open-source tool that turns dbt projects into a cleaner, interactive documentation experience with lineage, search, and AI chat, making data models easier to explore than standard dbt docs. It's a lightweight, self-hosted alternative to heavier data catalog tools, focused on usability for both technical and business users.
|
Opendataloader-pdf (GitHub Repo)
OpenDataLoader PDF is an open-source parser that turns PDFs into AI-ready Markdown, JSON (with bounding boxes), and HTML. It supports deterministic local extraction plus an AI hybrid mode for hard cases like scanned PDFs, complex tables, formulas, and charts.
|
kuva (GitHub Repo)
kuva is a Rust plotting library plus CLI for turning CSV/TSV data into publication-style visuals fast: 30 plot types, SVG by default, optional PNG/PDF output, and even terminal rendering. It's a lightweight way to add reproducible chart generation to pipelines, scripts, and shell workflows without hauling in a full Python plotting stack.
|
|
The broken economics of databases (9 minute read)
Database vendors look insanely profitable on gross margin, yet stay barely profitable because R&D and go-to-market costs are huge. As databases commoditize and hyperscalers own the infrastructure, vendors defend margins with differentiation, pricing opacity, and growing operational complexity. Net effect for data engineers: products often get more features but not simpler operations, because complexity itself helps preserve vendor economics.
|
|
Love TLDR? Tell your friends and get rewards! |
|
Share your referral link below with friends to get free TLDR swag!
|
|
|
| Track your referrals here. |
|
|
|
0 Comments