A live data app for $0: DuckDB, Astro, and no BI tool (8 minute read)
A $0 data app can work well when the stack is simple: open data, DuckDB transforms, Astro/Leaflet/SVG for the interface, GitHub Actions for refreshes, and existing static hosting. AI-assisted coding makes bespoke, on-brand data products cheaper and more flexible than BI tools when you do not need governance, shared metrics, or heavy analytics workflows.
|
Enabling Data Intelligence: Data Profiling Framework at Halodoc (10 minute read)
Halodoc built an Airflow-native data profiling framework to replace repeated ad hoc SQL profiling across hundreds of tables and multiple systems. It combines column-level profiling, join intelligence, and source-table analysis, running compute in Redshift or Athena and isolating each table in Kubernetes pods with run_id-based, idempotent staging writes. The result is a self-serve, searchable view of data quality and table relationships.
|
The Postgres Developer's Guide to Vector Index Tradeoffs (11 minute read)
Vector search in Postgres becomes an index-design problem once tables reach millions of vectors, filters enter the query path, and recall/latency tradeoffs start affecting product quality. Exact search is best for small datasets and recall baselines. HNSW is the default read-heavy ANN choice when data fits in memory, IVFFlat reduces memory and maintenance costs at the expense of more tuning, and StreamingDiskANN via pgvectorscale targets large indexes that outgrow RAM. Hybrid search with BM25 plus vectors in Postgres improves recall by combining semantic matching with keyword relevance.
|
|
Event-Driven vs. Polling Architectures for Agent Triggers (11 minute read)
Agent trigger architecture should be designed around delivery contracts, not a simplistic webhook-vs-polling choice. Webhooks are usually at-least-once, unordered, and best-effort. Polling can blow through rate limits. CDC and message buses offer stronger replay and durability, but still require idempotent handling. Mature agent systems typically combine fast-path events, reconciliation polling or replay, structural idempotency keys, and durable runtimes so long-running agents can survive duplicates, missed events, retries, and external waits.
|
SQLite is All You Need for Durable Workflows (4 minute read)
Durable AI workflows can use local SQLite plus Litestream backups instead of heavier orchestration or database infrastructure. The tradeoff is simple, cheap, inspectable state for agents, unless you need high availability or shared scalability, where Postgres still fits better.
|
MOR Isn't a Storage Optimization. It's an Architectural Shift (11 minute read)
Instead of synchronously rewriting entire files on every mutation (Copy-On-Write), MOR (Merge-On-Read) appends changes to log files and defers the expensive merge/compaction work to a background process, effectively time-shifting optimization from write time to a separate, controllable schedule. This design better supports high-frequency streaming updates and CDC workloads, though it introduces tradeoffs in read amplification and compaction management.
|
|
ktx (GitHub Repo)
ktx is a local context layer that helps data agents query warehouses more accurately by combining approved metrics, join logic, warehouse metadata, and company knowledge into one searchable surface. It is aimed at teams that want Claude, Codex, Cursor, or other agents to reuse trusted definitions instead of inventing SQL from scratch.
|
Apache Iceberg 1.11.0 Adds registerView: Closing a Catalog Migration Gap (4 minute read)
Apache Iceberg 1.11.0 adds ‘registerView', a metadata-preserving migration primitive that lets catalogs register existing Iceberg views from metadata files instead of recreating them from SQL. The release also adds a dedicated REST Catalog endpoint enabling cleaner authorization, capability signaling, and backward compatibility. This closes a migration gap for catalog-to-catalog moves, DR workflows, blue-green catalog upgrades, and tools like the Apache Polaris Iceberg Catalog Migrator.
|
|
The best of CPDP 2026 (14 minute read)
Computers, Privacy, and Data Protection 2026 highlighted the regulatory pressure points shaping data governance and AI: age-gating, biometric age verification, health data, children's digital rights, AI chatbot privacy, and the widening gap between formal compliance and real-world enforcement. Panels emphasized concrete risks like biometric processing limits, 230 million weekly health-related ChatGPT queries, and the need for PETs, transparency, and stronger controls over platform work, content moderation, and generative AI use.
|
How we built a lab to evaluate data agents (22 minute read)
Hex built Shoebox, an internal eval “lab bench” for data agents, so teams can compare candidate runs against stable production baselines and judge improvements across prompts, models, memory, search, and workspace context. They also created Shorelane Commerce, a realistic fake business with messy warehouse data, because simple text-to-SQL benchmarks do not reflect the ambiguity, context, and data debt real analytics agents must handle.
|
|
Love TLDR? Tell your friends and get rewards! |
|
Share your referral link below with friends to get free TLDR swag!
|
|
|
| Track your referrals here. |
|
|
|
0 Comments