TLDR

Together With

TLDR Data 2026-04-16

Together AI benchmarked 4 data platforms. See who was fastest at 100GB (Sponsor)

Together AI benchmarked Redshift, Athena, and ClickHouse against MotherDuck on 100GB TPC-DS:

>> MotherDuck ran the full suite on standard ANSI SQL with no rewrites.

>> Others fell short.

Read the full breakdown

Today Together runs 128 concurrent users, 40 read replicas, and unconstrained AI agent queries against MotherDuck at terabyte scale. For a fraction of the cost.

Want to see these results on your data? Try MotherDuck for free

P.S. MotherDuck's MCP server gives any AI agent direct access to your warehouse. Dives turns those queries into live, interactive, embeddable data apps instantly. Warehouse meets BI

📱

Deep Dives

We Let AI Agents Orchestrate Our ML Experiments (5 minute read)

Teads built a multi-agent system to autonomously orchestrate their entire ML experimentation lifecycle. Specialized agents handle idea generation, code writing, experiment execution, result analysis, and decision-making, reducing experiment cycles from days to hours, increasing meaningful experiments by 4.5x, and improving production model performance by 8–12%

How we OCR'ed 30,000 papers using Codex, open OCR models, and Jobs (7 minute read)

Hugging Face used an open OCR model (Chandra-OCR-2) and Codex-generated scripts on serverless GPUs to convert ~27,000 papers into Markdown so they can be “chat with paper”-enabled. Running jobs in parallel made the process fast (~30 hours) and relatively cost-efficient (~$850).

Scaling Recommendation Systems with Request-Level Deduplication (9 minute read)

Pinterest Engineering introduced request-level deduplication to efficiently scale their recommendation systems by sorting data by user + request ID in Apache Iceberg for massive compression to process and store the request-level data only once per unique request, using a separated context transformer with KV caching in ranking, and applying targeted fixes like SyncBatchNorm and user-level masking during training.

Zero Downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story (8 minute read)

Yelp upgraded over 1,000 Cassandra nodes from version 3.11 to 4.1 across multiple clusters with zero downtime using a careful rolling upgrade strategy with Kubernetes init containers, automated pre-flight/flight/post-flight stages, version-specific images, and strict monitoring during the mixed-version period. The upgrade delivered 21-60% latency improvements overall, faster streaming, better observability, new guardrails, and preparation for Cassandra 5.

🚀

Opinions & Advice

How To Set-up Your Data Stack For 2026 – Data Infrastructure For AI (8 minute read)

Building a successful AI-ready data infrastructure starts with simplicity and strong fundamentals rather than chasing the latest AI hype. Instead, focus on solid ingestion tools, SQL-based transformations (such as dbt), choosing the right storage/compute layer (warehouse or lakehouse), and strong data quality, governance, and ownership.

Stop Treating AI Memory Like a Search Problem (22 minute read)

Reliable AI memory needs more than store-and-retrieve approaches: it must manage decay, contradiction, confidence, compression, and expiry. The proposed SQLite-based design keeps plain-text memories locally, then scores them based on importance, confidence, and decay, so outdated or weakly supported facts stop dominating retrieval. New memories can supersede older ones, expired items fade into an archive, and duplicate beliefs are merged into higher-signal summaries.

Power BI and Support for Third-Party Semantic Models (6 minute read)

Power BI doesn't properly support third-party semantic models mainly due to technical limitations around query behavior, aggregation, and architecture, not competitive intent. As a result, Microsoft recommends keeping all metrics and business logic within Power BI's own semantic model for reliability and performance.

💻

Launches & Tools

DuckLake v1.0: The Lakehouse Format Built on SQL Reaches Production-Readiness (23 minute read)

DuckLake v1.0 release marks the production-ready version of this SQL-native lakehouse format. Unlike traditional formats that store metadata as files in object storage, DuckLake keeps all metadata in a real database catalog (SQLite, PostgreSQL, or DuckDB itself), making the lakehouse behave like a regular database.

Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow (5 minute read)

Apache Airflow's new apache-airflow-providers-common-ai package adds native LLM and agent support with 6 operators and 20+ model providers, requiring Airflow 3.0+. It includes structured tasks like @task.llm, @task.agent, @task.llm_sql, file analysis, branching, and schema comparison, plus direct access to 350+ existing Airflow hooks as typed AI tools. Features built-in human approval flows, durable execution with step-level replay from object storage, and end-to-end token/tool observability.

KumoRFM-2: The Most Powerful Predictive Model, for Humans and Agents (6 minute read)

KumoRFM-2 is Kumo's relational foundation model for predictions that can reason directly on database tables, keys, and time history, without the usual feature-engineering pipeline. Kumo claims it beats supervised ML on common relational benchmarks in few-shot settings, pointing to a simpler way to turn warehouse data into predictive and agent-ready applications.

🎁

Miscellaneous

Managing context in long-run agentic applications (14 minute read)

Long-running agents quickly hit context window limits and suffer from "context rot" (losing important earlier information). Slack uses intelligent context pruning + summarization strategies, with periodic "reflection" steps where the agent reviews and condenses its own history, improving agent reliability and coherence over long time horizons.

Scaling Prometheus in 2026: The Complete Comparison Guide (4 minute read)

Prometheus-compatible long-term storage has matured into clear options: VictoriaMetrics for most teams needing 4-5x less RAM and low operational burden, Thanos for the lowest-friction migration from existing Prometheus, OpenObserve for full-stack observability at lower cost, GreptimeDB for unified SQL-first metrics/logs/traces, and Mimir for large enterprises with 500+ developers and dedicated SREs. The key decision factor is not just infrastructure cost but “Ops Tax”.

⚡

Quick Links

Being a Staff+ Data Scientist in 2026 (12 minute read)

Staff+ data scientists spend less time on pure analysis and more on stakeholder alignment, communication, and navigating ambiguity across teams.

OpenDuck (GitHub Repo)

OpenDuck is an open-source system that brings MotherDuck-style cloud capabilities to DuckDB, enabling hybrid queries that run across local and remote data with transparent access.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://sparklp.co/32815a84/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Latest

Donate Your Car Now

Header Ads Widget

DuckLake Hits Production 🦆, Autonomous ML Experimentation 🧪, OCR at Scale 📄

TLDR Data 2026-04-16

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Report Abuse

Ad Space

Popular Posts

Masad Clipper And Stealer - Windows Spyware Exfiltrating Data Via Telegram (Samples)

A new knife fight for talent

e-Gleams: Glaucoma Research Update

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Get Lifetime Access To 1000+ Premium Online Training Courses For Just $59

Where to Buy Cheap Youtube Views?

Novell Zenworks MDM: Mobile Device Management For The Masses

Menu Footer Widget

Latest

Header Ads Widget

DuckLake Hits Production 🦆, Autonomous ML Experimentation 🧪, OCR at Scale 📄

TLDR Data 2026-04-16

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Social Plugin

Ad Space

Popular Posts

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Menu Footer Widget