Viaduct 1.0 and the Future of Airbnb's Data Mesh (5 minute read)
Viaduct 1.0 is Airbnb's open-source data-oriented service mesh built on GraphQL. It provides a single unified schema for accessing any data source across the company while enabling decentralized development through multi-tenant modules as teams contribute their own schema and resolvers without operating separate GraphQL services, striking a balance between a monolithic GraphQL server and full federation.
|
AWS Outage May 2026: Lessons for Database Disaster Recovery (10 minute read)
A major AWS US-EAST-1 outage in May was triggered by a data center overheating event in a single availability zone, causing multi-hour disruptions for high-profile services like Coinbase. The incident highlighted the critical difference between Multi-AZ high availability (which failed to protect latency-sensitive workloads) and true cross-region disaster recovery.
|
|
Exploring schema evolution with ontology-driven propagation (4 minute read)
A plain-English ontology can act as a runtime access policy that survives schema evolution, letting an LLM classify columns column-by-column using row counts, cardinality ratios, and sampled values. The approach keeps policy separate from pipeline code, but it does not cover numeric sensitive inferences or cross-column re-identification.
|
The Modern Data Stack is Overcomplicated: Data Ingestion (17 minute read)
Data ingestion looks simple, but the wrong choice can create hidden costs through broken connectors, schema drift, over-engineering, and wasted engineering time. The best approach is usually a hybrid: managed connectors for standard SaaS, streaming only when low latency truly matters, and custom pipelines for niche or legacy sources.
|
Welcome to ORDER BY Jungle (11 minute read)
PostgreSQL resolves column names and expressions in ORDER BY clauses in inconsistent ways. For example, bare identifiers (e.g. ORDER BY a) first look for aliases in the SELECT list, while any expression (e.g. ORDER BY -a) resolves against the FROM clause, leading to confusing behaviors with aliases, quoting, GROUP BY, window functions, and UNION.
|
|
ducklake-sdk (GitHub Repo)
ducklake-sdk is an alpha Rust/Python SDK for reading and writing DuckLake tables without running DuckDB. It implements the DuckLake spec in a Rust core, with Python integrations for Polars, Arrow, and DuckDB, targeting SQL-catalog metadata plus Parquet storage. Useful for embedding DuckLake access into apps, pipelines, or engines directly.
|
Apache Arrow as Data Interchange (5 minute read)
Apache Arrow is rapidly becoming the universal in-memory columnar format for data interchange across the modern data stack. Instead of repeatedly serializing, deserializing, and copying data between tools (Pandas → Spark → databases, etc.), Arrow enables zero-copy handoff, where systems share the exact same memory layout, dramatically reducing CPU overhead.
|
What Matters in Production RAG (8 minute read)
Key requirements for production RAG include smart chunking strategies (recursive, semantic, and structure-aware), robust indexing pipelines with document registries, content hashing for efficient updates, alias-based zero-downtime index switching, careful embedding model management, and strong observability with detailed tracing, chunk attribution, and retrieval quality metrics.
|
|
Your AI agent deletes critical data: Who is responsible? (5 minute read)
AI agents that can write to production systems create a new accountability and recovery problem: a Replit agent once deleted a live database, and the real issue was the absence of clear ownership, guardrails, and rollback. With 86% of IT/security leaders expecting agents to outrun current controls, governance is a shared responsibility across architecture, security, legal, and business. Practical controls like policy boundaries, observability, human-in-the-loop triage, and explicit recovery mechanisms are essential to prevent autonomous tools from becoming enterprise-wide risk.
|
Context pruning: cut LLM tokens without losing quality (9 minute read)
Context Pruning is the practice of selectively removing low-value tokens, sentences, or passages from an LLM's input to reduce cost, latency, and often improve output quality. It includes techniques such as token-level, sentence/chunk-level, attention-based, and dynamic layer-progressive pruning, and works best when paired with semantic caching.
|
|
Love TLDR? Tell your friends and get rewards! |
|
Share your referral link below with friends to get free TLDR swag!
|
|
|
| Track your referrals here. |
|
|
|
0 Comments