Latest

6/recent/ticker-posts

Header Ads Widget

Snowflake vs Databricks’ Reyden ๐ŸฅŠ, Coding Agent Discipline ๐Ÿ› ️, Apache Flink 2.3.0 ๐ŸŒŠ

AI coding is shifting from token maxing to token efficiency as teams move from subscriptions to per-token billing and costs become harder to control ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

TLDR Data 2026-06-29

๐Ÿ“ฑ

Deep Dives

12TB of AI Coding Agent Logs (17 minute video)

AI coding is shifting from token maxing to token efficiency as teams move from subscriptions to per-token billing and costs become harder to control. Better workflows rely on careful upfront planning, right-sized agent sessions, cleaner context, API-first tooling, strong CI, and focused human review.
Automated Schema Evolution in Pinterest's Next-Generation DB Ingestion Framework (11 minute read)

Pinterest built schema evolution for CDC across Kafka, Flink, Spark, and Iceberg, treating schema as a contract. Source schemas and sink mappings generate Flink/Spark/Iceberg artifacts, while push- and pull-based checks detect drift. Changes roll out with PR auditability, SLA-based recovery, and backfill fallbacks.
How we built SmithDB's inverted index for full-text search (11 minute read)

SmithDB builds inverted indexes with efficient JSON parsing, tokenization, string interning, and radix sorting; interning lifted construction speed by ~2.2x. Streaming compaction bounds memory regardless of index size, while aligned chunks and request coalescing reduce object-storage GETs. Queries merge local-SSD indexes with object-storage segments for sub-second freshness.
Turning Scattered Data Into Queryable Segments at Scale: How Razorpay Built Its Customer Data Platform (11 minute read)

Razorpay built an in-house Customer Data Platform to turn scattered transaction data across 500M+ user profiles into real-time, queryable audience segments, using Airflow DAGs + Spark for daily segment computation (with reuse and deduplication), Temporal workflows for reliable DynamoDB ingestion with zero-downtime versioning, and privacy-preserving hashed lookups.
๐Ÿš€

Opinions & Advice

Why Real Workload Performance is the Metric that Matters (7 minute read)

Real workload performance matters more than headline benchmarks because production systems need to handle real data, concurrency, latency, scale, and cost. Performance claims should be judged by whether the workload matches yours, the setup is production-ready, results hold as data grows, and the product is actually available.
Building My Own Self-Hosted dbt Cloud (6 minute read)

A self-hosted dbt Cloud-style app can deliver much of the developer experience by combining dbt Core with a React/FastAPI interface and Prefect for orchestration. The biggest lesson is to use APIs, not CLI scraping, for reliable job management, logs, deployments, and real-time run status.
๐Ÿ’ป

Launches & Tools

AI lacks real-world context - and it's costing business trillions annually (Sponsor)

Even the best AI still misses 60% of demand variability - and the problem isn't the models. This PredictHQ guide lays out how Uber and Domino's are grounding AI with real-world context from global events and demand data. Want to see the impact? Try the PredictHQ API and MCP free for 14 days.
Apache Flink 2.3.0 Release Announcement (8 minute read)

Flink 2.3 moves toward a declarative streaming data platform. Materialized tables can evolve through DDL and query changes while avoiding unnecessary historical reprocessing in many common cases. SQL adds changelog conversion, explicit upsert conflict handling, and native S3 support without Hadoop dependencies.
Hardwood 1.0: A Fast, Lightweight Apache Parquet Reader for the JVM (9 minute read)

Hardwood 1.0 is a production-ready, JVM-native Parquet reader for Java 21+ that removes mandatory dependencies and parallelizes page decoding across CPU cores by default. It covers Parquet physical/logical types, projections, predicate push-down, local and object-store files, with row and batch column APIs. Benchmarks show 16.5M rows/sec and ~17-18x selective push-down speedups.
Kafka Share Groups - Pathological Fetch Waits with Record_limit (13 minute read)

A notable performance pitfall in Kafka Share Groups arises when using record_limit with fewer consumers than partitions, especially under partition skew. This leads to pathological fetch waits, which can drastically slow consumption during backlog drains or skewed workloads. The simplest mitigation is to use at least as many consumers as partitions when running with record_limit.
๐ŸŽ

Miscellaneous

14x faster embeddings: how we rebuilt the ONNX path in Manticore (9 minute read)

Manticore rewrote its embedding pipeline on ONNX Runtime, slashing CPU waste and lifting throughput up to 14x for low-latency vector search. The design shares one thread-safe ONNX session, disables intra-op spinning, and processes documents individually to avoid lock contention and variable-length padding overhead.
We Ran 250 AI Agent Evals to Find Out if Skills Beat Docs. The Answer Is More Complicated Than We Expected (6 minute read)

Wix's 250-run evaluation found agent-optimized docs improved CLI task completion from 67% to 87%, cut token use 35%, and beat skills-only runs when skills were stale or misaligned. For API tasks, both hit 80% completion, but docs ran 31% faster while skills used 29% fewer tokens. Use optimized docs as the foundation, with skills as an evaluated caching layer.
How we used DSPy to turn AI evaluations into better responses in Dash chat (5 minute read)

Dropbox used DSPy to turn AI evaluations into concrete Dash Chat improvements, combining LLM-as-judge evals, human-labeled examples, offline replay, and statistical validation. The result was fewer incomplete answers, better coverage of user intent, and lower token use without compromising answer quality.

Quick Links

Host- and Domain-Level Web Graphs April, May, and June 2026 (3 minute read)

Fresh host/domain graphs offer 6B+ edges for large-scale link analysis without running your own crawler.
Gemma Interactions View (5 minute read)

A coding-agent challenge turned into a collaborative lab, with agents sharing playbooks, pooling quota, debugging each other's work, and stacking small improvements into big performance gains.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? ๐Ÿ“ฐ

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? ๐Ÿ’ผ

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments