TLDR

TLDR Data 2026-06-15

📱

Deep Dives

Semantic Search for AI Agents at Scale: Retrieval and Ranking for LinkedIn's Hiring Assistant (15 minute read)

LinkedIn built MUSE (Member Understanding Semantic Embeddings) to power semantic search inside Hiring Assistant. It uses a dual-tower Matryoshka embedding model trained on millions of high-quality labels from an LLM Teacher grounded in product policy, combining embedding-based retrieval with a downstream engagement-optimized ranker.

Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant (6 minute read)

Spotify built Vedder, an AI data assistant for 2,100+ users across 177 clusters, to move beyond schema-only RAG across 70,000 datasets. Domain experts curate each cluster with datasets, vetted question-SQL pairs, and business docs. Only 12.5% of mined query pairs were accepted, so health scoring tracks drift, validity, coverage, and reproducibility to keep context reliable.

How Feldera Works: A True Incremental View Maintenance Engine (3 minute read)

Feldera treats streams as incremental SQL views, using DBSP to propagate deltas instead of recomputing joins and aggregations. Inserts, deletes, and updates become Z-set changes, so only affected rows are updated. The result is batch-SQL-like semantics for continuous pipelines with lower CPU, less memory pressure, and predictable latency.

🚀

Opinions & Advice

A frontier without an ecosystem is not stable (4 minute read)

Companies need to compound human expertise and AI capability, not just rely on the best model. By owning their workflows, evals, and institutional knowledge, firms can keep improving while avoiding a future where all value flows to a few frontier models.

The Mythical Agent-Month (10 minute read)

AI coding agents reduce coding labor, but not the hardest parts of software: design judgment, scope control, testing, and maintainability. They reduce accidental complexity, but can create technical debt, architectural drift, and bloated codebases at machine speed. The edge shifts to experts who can steer the model, say no, and keep systems production-ready.

The Bill Arrives: How to Manage Agentic AI Costs at Scale (17 minute read)

Uber's AI budget blowout shows agentic AI is a task-economics problem, not a token-pricing problem. Claude Code adoption hit 84% across 5,000 engineers, exhausting the annual AI budget by mid-April. With spend hidden in re-sent context, retrieval, orchestration, governance, and retries, teams need to measure value per task, control context, and build stateful agent infrastructure.

💻

Launches & Tools

Join renowned data strategist Doug Laney and Matia CEO Benjamin Segal for a discussion on the future of the data stack. (Sponsor)

Your data stack is held together with duct tape. You know it. Your team knows it. On June 24th, Matia CEO Benjamin Segal and Doug Laney, author of Infonomics and Data Juice, are doing a live fireside on what comes next.

Register now→

Introducing Omnigent: A Meta-Harness to Combine, Control, and Share Your Agents (7 minute read)

Omnigent is an open-source Databricks meta-harness that makes agents like Claude Code, Codex, Pi, and custom agents work together through one shared layer. It helps teams compose agents, add security and cost controls, share live sessions, and keep workflows portable as tools change.

Introducing Flights: Agent-Native Ingest in MotherDuck (4 minute read)

Flights is MotherDuck's new agent-native data pipeline feature that lets AI agents easily build, run, and schedule ingestion and transformation workloads using a secure, general-purpose Python runtime. It has native support for dlt pipelines, direct DuckDB execution, logging, scheduling, and versioning. Agents can create Flights via MCP server, SQL table functions, or the UI.

Apache DataFusion 54.0.0 Released (7 minute read)

Apache DataFusion 54.0.0 adds major SQL upgrades, including LATERAL joins, SQL lambda functions for arrays, a new Arrow-based Avro reader, and spill-to-disk for memory-heavy nested loop joins. Performance also jumps, with near-unique LEFT/FULL sort-merge joins up to 20–50× faster and repartition-heavy operations improving by up to 50%.

🎁

Miscellaneous

The Hidden Cost of ai_parse_document in Production (10 minute read)

Databricks' ai_parse_document + ai_query can turn messy PDFs into structured JSON in a few SQL lines, but the challenge is reliability at scale. Every rerun reopens parsing and LLM costs, corrected documents can create duplicates, and even temperature 0 still produces non-deterministic outputs that undermine auditability. A pipeline design with checkpoints, versioned prompts, and deduplication reduces reprocessing cost and improves reproducibility. Deterministic parsers like OpenDataLoader PDF are more appropriate when document templates are consistent.

Linux Foundation Announces OpenSharing Project to Standardize AI Asset and Data Exchange (4 minute read)

Databricks has handed the Delta Sharing protocol over to the Linux Foundation. OpenSharing extends Delta Sharing to AI models, agent skills, and unstructured data across clouds and platforms. It adds standard APIs for discovery, authorization, and access, with support for existing Delta Sharing recipients plus Apache Iceberg/REST Catalog clients. The project aims to replace proprietary marketplaces with a single standard for enterprise AI asset distribution.

⚡

Quick Links

New framework for auditing machine unlearning (6 minute read)

Google Research introduced Regularized f-Divergence Kernel Tests to audit machine unlearning and privacy leakage more reliably than standard two-sample tests.

Feature Stores from Scratch: A Minimal Working Implementation (5 minute read)

DuckDB + Redis deliver a five-component DIY feature store that avoids training-serving skew for real-time ML and RAG systems.

SQL to ER Diagram (Tool)

SQL to ER Diagram is a free, open source, browser-only tool that turns pasted SQL schemas into clean, interactive ER diagrams without uploading your data or requiring signup.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://sparklp.co/32815a84/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Latest

Donate Your Car Now

Header Ads Widget

Databricks’ Agent Orchestrator 🕹️, Ecosystems Beat Models 🔁, LinkedIn’s Search Brain 🔍

TLDR Data 2026-06-15

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Report Abuse

Ad Space

Popular Posts

Tushar, please read: cancer treatment is out of reach for some

New film portrays family navigating an Alzheimer's diagnosis

Keep Calm and LAGER On!

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Get Lifetime Access To 1000+ Premium Online Training Courses For Just $59

Where to Buy Cheap Youtube Views?

Novell Zenworks MDM: Mobile Device Management For The Masses

Menu Footer Widget

Latest

Header Ads Widget

Databricks’ Agent Orchestrator 🕹️, Ecosystems Beat Models 🔁, LinkedIn’s Search Brain 🔍

TLDR Data 2026-06-15

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Social Plugin

Ad Space

Popular Posts

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Menu Footer Widget