📱

Deep Dives

Are Foundation Models Ready for Your Production Tabular Data? (10 minute read)

Foundation models for tabular data, such as TabPFN, CARTE, TabuLa-8b, and TabDPT, now enable zero-shot and few-shot predictions, outperforming classical methods like XGBoost on small to mid-sized, heterogeneous datasets. These models adopt advanced architectures (e.g., transformer-based attention, graph embeddings, and context-aware learning) and offer familiar Scikit-learn-style APIs for seamless integration. Key limitations persist, notably with computational demands and scalability, making them less suitable for large tabular datasets in production environments.

Inside Husky's query engine: Real-time access to 100 trillion events (10 minute read)

Datadog's Husky query engine separates query planning from execution to support high concurrency and flexible routing. The Planner builds logical execution graphs (stages), breaks them into segments, and generates execution plans. The Router matches rules to route segments to appropriate backends. The Executor sends tasks to specialized engines (e.g. SQL engine and custom operators). This modular split allows scalability, pluggable backends, and dynamic optimization per query.

Building a Scalable Data Warehouse Backup System with AWS (6 minute read)

Scribd built a backup system for its petabyte-scale S3 data warehouses, focusing on monthly, incremental backups of multiple databases. The new hybrid approach uses AWS Lambda for small workloads and ECS Fargate for larger ones, emphasizing cost efficiency by copying only new or changed Parquet files while always retaining delta logs. Data is validated via S3 Inventory manifests, processed in parallel, and archived in Glacier.

🚀

Opinions & Advice

Practical Guide to Semantic Layers: From Definition to Demo (10 minute read)

Semantic layers standardize metric definitions across teams, eliminating inconsistent calculations. This post demos this using Boring Semantic Layer + DuckDB/Ibis, showing how metrics and dimensions in YAML power consistent queries through Python or Streamlit. The space is heating up with tools like dbt SL, Malloy, and Snowflake's OSI driving interoperability.

The Modern Data Stack's Final Act: Consolidation Masquerading as Unification (14 minute read)

Major vendors are increasingly consolidating the Modern Data Stack by merging layers and acquiring adjacent tools, promoting "unified platforms" that promise simplicity but deliver vendor lock-in and rising switching costs. Despite narratives of integration, most unification is superficial, with true architectural interoperability (shared metadata, governance, and semantics) remaining elusive. Only a few platforms, such as Palantir Foundry and DataOS, offer genuine end-to-end architectural unity. Rigorously assess claims of unification, prioritizing open standards, composable interfaces, and long-term portability to avoid costly enclosure in proprietary ecosystems.

How to Get AI to Deliver Superior ROI, Faster (6 minute read)

Challenges in accelerating AI ROI stem from data silos, QA inefficiencies like garbage tokens and incomplete evaluations, resource overkill with oversized models, and cultural biases favoring "bigger is better." Best practices include lean AI for lower costs, early CFO/stakeholder education, AI-aided debugging, user feedback integration, performance-focused models, and proactive data sourcing to avoid fixes.

💻

Launches & Tools

Introducing Apache Airflow® 3.1 (8 minute read)

Airflow 3.1 builds on 3.0 with key enhancements for modern data workflows. It adds Human-in-the-Loop (HITL) operators and synchronous DAG execution to better support GenAI/MLOps use cases, introduces a new React-based plugin interface for custom UI extensions, and improves UX with features like favoriting DAGs and language selection. It also supports Python 3.13, shows DAG parsing times, and includes a new trigger rule.

Why You Should Prefer MERGE INTO Over INSERT OVERWRITE in Apache Iceberg (7 minute read)

Prefer MERGE INTO with Merge-on-Read (MOR) strategy over INSERT OVERWRITE in Apache Iceberg for efficient data updates, which leads to better cost savings, performance, and adaptability to evolving partitioning schemes in analytics pipelines. While INSERT OVERWRITE rewrites partitions, increasing I/O and storage, it's also vulnerable to partition evolution. MERGE INTO (MOR) appends deltas at the file level, avoiding rewrites and supporting metadata-only partition adds, but requires periodic compaction to manage small-file proliferation.

Postgres Migrations Using Logical Replication (7 minute read)

Migrating large Postgres databases (e.g., terabyte-scale) without downtime is challenging, especially on platforms like RDS lacking WAL access. Traditional tools like pg_dump/pg_restore work for smaller datasets, while WAL-based backups suit physical copies but not logical ones. Logical replication addresses this by streaming changes post-initial sync, though it skips schema, indexes, and sequences.

🎁

Miscellaneous

F3: The Open-Source Data File Format for the Future (45 minute read)

F3 is a next-generation and work-in-progress open source columnar format designed for interoperability, extensibility, and efficiency. It embeds WebAssembly decoding logic in each file so any reader (old or new) can decode newer encodings. It decouples I/O unit layout from data row groups, supports flexible dictionary scopes, and uses flatbuffers for fast metadata access. In evaluations, F3 matches Parquet/ORC performance while enabling seamless evolution. The implementation code is available on GitHub.

Introducing Microsoft Agent Framework: The Open-Source Engine for Agentic AI Apps (13 minute read)

Microsoft has released the open-source Agent Framework in preview, unifying the capabilities of Semantic Kernel and AutoGen to streamline AI agent development for both Python and .NET. The framework enables rapid agent creation with fewer than 20 lines of code, supporting orchestration patterns such as sequential, concurrent, group chat, and handoff with production-grade durability. Integrated with Azure AI and Visual Studio Code, it provides enterprise integrations, MCP connectors, and pluggable memory modules with straightforward YAML/JSON configurations.

⚡

Quick Links

Why Python Data Engineers Should Know Kafka and Flink (3 minute read)

Recent Python API developments make it simple to build streaming pipelines using Kafka and Flink without leaving familiar syntax.

The dbt Fear Index Just Spiked (2 minute read)

Rumors of a Fivetran acquisition sent dbt forks soaring, showing engineers' urge for control over core tooling.

Mooncake Labs joins Databricks to accelerate the vision of Lakebase (2 minute read)

Databricks has acquired Mooncake Labs to power Lakebase, an OLTP database built on Postgres and seamlessly integrated into the Databricks platform.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://sparklp.co/32815a84/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Latest

Donate Your Car Now

Header Ads Widget

Airflow 3.1 Released 🚀, F3 Redefines Columnar Formats 🔄, Transformers Meet Tabular Data 🧠

TLDR Data 2025-10-06

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Report Abuse

Ad Space

Popular Posts

Achieve 2X the impact with your holiday gift now.

OpenAI custom chips 🤖, Facebook Jobs 💼, NanoGPT 👨‍💻

Airflow 3.1 Released 🚀, F3 Redefines Columnar Formats 🔄, Transformers Meet Tabular Data 🧠

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Get Lifetime Access To 1000+ Premium Online Training Courses For Just $59

Where to Buy Cheap Youtube Views?

Novell Zenworks MDM: Mobile Device Management For The Masses

Menu Footer Widget

Latest

Header Ads Widget

Airflow 3.1 Released 🚀, F3 Redefines Columnar Formats 🔄, Transformers Meet Tabular Data 🧠

TLDR Data 2025-10-06

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Social Plugin

Ad Space

Popular Posts

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Menu Footer Widget