Latest

6/recent/ticker-posts

Header Ads Widget

Google’s Tabular Foundation Model 🧾, Meta’s Data Eng Agent 🛠️, LLM Spark Debugger 🚦

TabFM is a foundation model for tabular classification and regression that reframes prediction as in-context learning, removing per-dataset training ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

TLDR Data 2026-07-02

📱

Deep Dives

How We Built DEmate: Taming LLMs for Data Engineering at Meta (7 minute read)

Meta's DeMate is an internal LLM-powered assistant for data engineers that helps with writing SQL, generating pipelines, reviewing code, and understanding complex data flows at massive scale. The architecture combines RAG over internal data catalogs, schema documentation, and code repositories with carefully engineered prompts, multi-step reasoning chains, and human-in-the-loop feedback loops for evaluation and continuous improvement.
Using LLMs to Analyze Spark SQL Plans: A Practical Approach to Debugging Long-Running Jobs (8 minute read)

Instead of manually parsing complex physical plans and DAGs for debugging long-running Spark SQL jobs, Expedia feeds the plans (along with relevant context) by using LLMs to analyze and explain Spark execution plans to quickly identify bottlenecks, inefficient joins, skewed data, or suboptimal operators, significantly speeding up troubleshooting for production Spark workloads.
Ontology Everywhere! (8 minute read)

Ontologies are re-emerging as a practical data platform layer because AI agents need explicit business meaning, not just schemas or dashboards. Unlike data models, ontologies encode shared concepts, typed relationships, constraints, and limited inference. In enterprise tools, this often appears as typed-edge traversal, semantic layers, or knowledge graphs. High-value deployments still require human-curated semantics, especially where systems can write back and act on decisions.
Building Indexes on a Moving Target (20 minute read)

Apache Hudi explores the challenges and solutions for building and maintaining indexes on continuously updating datasets (a "moving target") using different indexing strategies, from simple bloom filters to more advanced approaches while handling the trade-offs between index freshness, query performance, write overhead, and scalability in large-scale data lakes with high-velocity updates.
🚀

Opinions & Advice

Query Faster, Query Smarter: Our Move to DuckDB and What We Learned (4 minute read)

Arcesium migrated thousands of SQL queries from Athena to Trino to DuckDB over 18 months, cutting query costs by ~50% and reducing query runtime by ~50% for small-to-medium workloads. Athena hit account/service limits, while Trino solved scalability but increased resource cost. DuckDB delivered the needed speed with ~40% lower memory footprint. The migration required handling Glue-less schema evolution, Parquet compaction, JSON fallbacks for STRUCT mismatches, and thread parallelism tuning.
Too many tables are bad for you (6 minute read)

Having too many tables in PostgreSQL is a bad idea and can seriously hurt performance. The hidden costs can come from bloated catalogs and slower query planning to increased I/O. Practical guidance includes consolidating small, related tables, avoiding excessive schema-per-tenant patterns, monitoring catalog size and planning time, and using inheritance or declarative partitioning when appropriate.
Never seen a data quality issue that wasn't actually an ownership problem (4 minute read)

Data quality failures are usually ownership failures: when multiple teams consume the same metric but no single person controls its definition, calculation, and change process, trust erodes and fixes stay temporary. The practical remedy is explicit metric governance: one named owner, clear decision rights, version/change control, and enforceable quality rules tied to the metric.
💻

Launches & Tools

Introducing TabFM: A zero-shot foundation model for tabular data (4 minute read)

TabFM is a foundation model for tabular classification and regression that reframes prediction as in-context learning, removing per-dataset training, hyperparameter tuning, and feature engineering. It was trained on hundreds of millions of synthetic datasets generated with structural causal models and benchmarked on TabArena across 38 classification and 13 regression datasets (700 to 150,000 rows), where it outperformed heavily tuned tree-based baselines.
TiDB (GitHub Repo)

TiDB is an open-source, cloud-native distributed SQL database with strong consistency, MySQL compatibility, horizontal scaling, high availability, and hybrid transactional/analytical processing. It separates compute and storage, supports TiKV and TiFlash, and is positioned for workloads needing transactions, analytics, vector search, and scalable infrastructure.
SedonaDB 0.4: GPU-Accelerated Spatial Joins (3 minute read)

SedonaDB 0.4 adds RayBooster, a GPU spatial join engine that uses NVIDIA ray tracing cores to accelerate geometry intersection queries. It delivers up to ~5.9x faster joins, lower AWS costs, and in some cases lets a consumer RTX 3090 beat an H100 on spatial workloads.
🎁

Miscellaneous

How To Corrupt An SQLite Database File (14 minute read)

SQLite is highly resistant to corruption, but can still be damaged by unsafe file access, bad backups, missing journals, broken locking, failed syncs, faulty storage, memory bugs, or risky PRAGMA settings.
Data Residency Is Not a Legal Problem. It Is An Infrastructure Design Problem (5 minute read)

Data residency is an infrastructure problem, not a storage-only policy issue: regulated workloads must account for where data is stored, processed, logged, backed up, accessed, and where ML experiments run. Region parity gaps in managed services can force cross-region workarounds or delayed migrations, so teams need region-aware platforms with reproducible CI/CD, RBAC, audit logs, local backups, and portable compute.

Quick Links

No babysitting, not today (9 minute read)

dbt Labs showcases real-world Wizard use cases for dbt Fusion, highlighting advanced patterns that go beyond basic modeling.
Your AI isn't underperforming. Your data foundation is (4 minute read)

TokenMaxxing is the trap of optimizing AI for visible activity and token usage rather than real business outcomes.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments