Latest

6/recent/ticker-posts

Header Ads Widget

dbt Core v2 Alpha 🦀, Cart Prediction with LLMs 🛒, Ray vs Daft 🧪

dbt Core v2.0 alpha makes the Fusion engine’s Rust-based runtime open source under Apache 2.0, unifying Core and Fusion around a shared ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

Together With Fivetran

TLDR Data 2026-06-04

It's official: Fivetran + dbt Labs merge to build the data foundation for trustworthy AI agents (Sponsor)

Only 1 in 6 orgs have the data foundation for agentic AI. Fivetran and dbt Labs will now join forces to redefine the future of open data infrastructure, enabling organizations to accelerate their AI initiatives with a robust data foundation.

Alongside the merger, Fivetran has revealed the first combined innovations from Fivetran + dbt:

  • dbt Core v2.0: open sourcing of dbt Fusion engine runtime
  • dbt State (preview): caching layer for data pipelines that reduces underlying infra costs by >30% 
  • dbt Wizard (beta): autonomous support for model authoring, refactoring, and debugging
  • Agents Schema: open source standard for agentic context

Try Fivetran for free

📱

Deep Dives

Your Cart Has a Story. Here's How We Learned to Read It (7 minute read)

Zepto built a Cart Contextual Model that treats shopping carts as “sentences” and uses a Transformer-based masked language model (MLM) to infer user intent in real time as items are added. By training on historical cart patterns with temporal, geographical, and product signals plus inverse-frequency masking to handle long-tail items, the model predicts what else the user will likely buy.
A field journal on Ray Data and Daft for multimodal data lake (14 minute read)

After running 8 production-like use cases side-by-side, Ray Data was selected over Daft primarily for superior stability and resilience at scale (especially in complex async LLM inference) while acknowledging Daft's strengths in ergonomic native multimodal primitives and cleaner code for many operations.
Vector Search in Manticore Search: A Deep Dive (28 minute read)

Manticore Search argues vector search should be tuned like a production retrieval system, not treated as a default embedding feature. It recommends aligning similarity metrics with models, tuning HNSW for recall, latency, and memory, and using batching, chunk optimization, and physical backups to keep indexes consistent.
🚀

Opinions & Advice

The Rise of Multi-Query Engines (7 minute read)

AI agents are creating more small, bursty data queries, making single-warehouse costs harder to manage. Multi-engine routing cuts cost by sending each query to the best engine while keeping familiar workflows.
Debunking 8 data layout myths: why Liquid Clustering outperforms partitioning (11 minute read)

Databricks debunks 8 common myths about data layout, arguing that Liquid Clustering is superior to traditional Hive-style partitioning for modern lakehouses. Unlike rigid partitioning, Liquid Clustering dynamically organizes data using clustering keys that evolve over time, supports row-level concurrency, metadata-only operations, and works seamlessly across open table formats.
💻

Launches & Tools

dbt Core v2 is here: still open source, now rebuilt for what's next (9 minute read)

dbt Core v2.0 alpha makes the Fusion engine's Rust-based runtime open source under Apache 2.0, unifying Core and Fusion around a shared foundation with faster parsing, Parquet artifacts, better local docs, simpler installs, and a tighter language spec. Fusion remains the recommended free CLI for most users, while Core v2 serves teams that need fully open source code or custom OSS builds.
ingestr (GitHub Repo)

ingestr is a CLI ELT tool for moving data from many databases and SaaS apps into warehouses or storage with simple flags, no backend or custom code required. It supports incremental loads, easy install, and broad connector coverage.
Diving deep into Redis's new array data type (25 minute read)

Redis Array is a brand-new native data type (introduced in Redis 8.8) designed for constant-time positional access by index, filling a long-standing gap in Redis where position/index itself carries semantic meaning. It efficiently supports both dense and extremely sparse arrays using a hierarchical group-based structure, allowing fast random access, range queries, ring-buffer semantics, pattern matching across sparse data, and fixed memory usage.
Routing Multiple Query Engines with Iceberg (18 minute read)

QueryFlux is an open-source Rust-based SQL routing proxy that intelligently directs queries across multiple query engines (Trino, Spark, DuckDB, Snowflake, Athena, Flink, etc.), sharing the same Iceberg tables. It handles protocol translation, dialect conversion via SQLGlot, cost-aware routing, concurrency control, and health-based failover.
🎁

Miscellaneous

MongoDB and Stored Procedures (10 minute read)

MongoDB can run low-latency transactional logic without stored procedures by combining ACID transactions, bulkWrite, validation, indexes, and pipeline updates. This is demonstrated through an example that processes payments with card checks, vendor checks, limits, duplicate prevention, and ledger writes.
OpenTelemetry Launches “Blueprints” Initiative to Simplify Enterprise Observability Adoption (3 minute read)

OpenTelemetry launched “Blueprints” to simplify observability with standard patterns and reference implementations for Kubernetes, infrastructure, apps, and centralized telemetry platforms.

Quick Links

Authorization for AI agents: What to build before the EU AI Act deadline (6 minute read)

Lays out identity, policy, and audit patterns teams need to externally enforce least-privilege agent calls under upcoming regulation.
dltHub AI Workbench data quality toolkit: schema-aware checks that route their own fixes (4 minute read)

Preview adds persistent, metadata-driven data-quality decorators that fail fast and auto-route remediation in dlt pipelines.
Pluto 1.0 Release (12 minute read)

Pluto 1.0 marks the Julia notebook environment as stable, with major improvements to reproducibility, reactivity, sharing, accessibility, education, docs, and editor tools.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments