📱

Deep Dives

Reducing Runtime Errors in Spark: Why We Migrated from DataFrame to Dataset (8 minute read)

Migrating from DataFrame to Dataset in Apache Spark significantly reduces runtime errors by leveraging Scala's type safety to catch issues at compile time. DataFrames, being untyped, often lead to errors like column mismatches or null values that only surface during execution, disrupting pipelines. Datasets, with their strongly typed structure, ensure errors are caught early, improving debugging and reliability in complex data workflows. This shift enabled Agoda's team to streamline development, boost pipeline stability, and accelerate the delivery of data-driven features.

An LLM-Based Workflow for Automated Tabular Data Validation (6 minute read)

Automating data cleaning for tabular datasets can be streamlined using a structured framework focused on data validity based on expected formats, types, and distributions. The process involves detecting and correcting errors by defining data expectations through semantic and statistical analysis, leveraging large language models (LLMs) to identify issues and suggest corrections. Effective data validation hinges on iterative rule-based checks, employing libraries like Pandera and Great Expectations, with a human-in-the-loop approach enhancing transparency and reliability. An implementation of the presented method is available.

Tinder's migration to Elasticsearch 8 (10 minute read)

Tinder's ES6 to ES8 migration included three stages. Stage 1 involved achieving data consistency through dual writes to both ES6 and ES8 clusters. Stage 2 covered data reindexing into ES8, rigorous validation checks, performance benchmarking, and testing of critical functionalities like search relevance and recommendations. Stage 3 focused on deploying shadow traffic for real-time evaluation, gradually shifting live traffic while monitoring key metrics, and completing the cutover plus decommissioning the old cluster.

🚀

Opinions & Advice

What "Shifting Left" Means and Why it Matters for Data Stacks (22 minute read)

Shifting left in data engineering involves moving data quality checks, business logic, and governance processes earlier in the data lifecycle—closer to the data source. This strategy aims to detect and resolve data issues promptly, reducing downstream errors and fostering more maintainable data systems. By adopting declarative tools like SQLMesh, organizations can centralize logic, streamline development workflows, and ensure consistent metric definitions across the data stack. This approach not only enhances data quality and system performance but also aligns data practices more closely with software engineering principles, promoting efficiency and scalability.

Fully Local Data Transformation with dbt and DuckDB (7 minute read)

Engineers can now perform local data transformations using DuckDB and dbt with the dbt-duckdb adapter. A persisted DuckDB database, combined with external GeoJSON files, handles 400 MB of data in 40-45 seconds on a MacBook Pro, and the adapter's external materialization feature supports exporting to CSV, JSON, and Parquet files with a full refresh capability. DuckDB also integrates smoothly with a local PostgreSQL instance, enabling seamless querying of PostgreSQL data alongside DuckDB's efficient in-memory processing.

Ecosystem Considerations for Data Science (14 minute read)

Organizations often rush to adopt AI, seeing it as a quick fix for business challenges, without integrating it into a comprehensive data ecosystem framework. Successful ML/AI implementations hinge not on standalone technology, but on a holistic data strategy encompassing quality data, robust infrastructure, algorithms, and human expertise. Key considerations include aligning AI projects with business strategy, ensuring data readiness, and fostering cross-departmental collaboration. Focusing on the full data ecosystem rather than isolated AI ventures reduces failure rates and enhances business impact. Businesses must prioritize broad ecosystem foundations alongside AI development for transformative success.

💻

Launches & Tools

Introducing Apache Airflow 3.0 (Sponsor)

Four years in the making, this game-changing release introduces a new UI, DAG versioning, stronger security, and greater flexibility to run tasks anywhere at any time. Want an exclusive look at Airflow 3.0? Join the webinar to learn directly from Airflow experts and contributors. Save your seat

Simplifying Data Pipelines with Durable Execution (40 minute podcast)

Durable execution enables exactly-once processing in distributed systems, eliminating the need for external components like queues or CI pipelines. In this podcast, Jeremy Edberg, CEO of DBOS, discusses how their Transact library enables local resilience and streamlined orchestration through a serverless architecture. Managing version control in long-running workflows ensures updates don't disrupt execution. This approach is especially effective for complex pipelines and AI-driven systems that require reliability and maintainability.

Powering Semantic SQL for AI Agents with Apache DataFusion (12 minute read)

Wren Engine is a semantic SQL engine powered by Apache DataFusion, designed to enhance AI agents' interaction with enterprise data. Wren Engine enables AI agents to comprehend complex data relationships and business logic by integrating a semantic layer through its Modeling Definition Language (MDL), facilitating accurate and efficient SQL query generation across diverse databases. This approach addresses challenges in SQL dialect compatibility and schema understanding, allowing AI agents to deliver reliable, context-aware insights.

xorq: Multi-engine ML pipelines made simple (GitHub Repo)

Built on Ibis and DataFusion, xorq is a deferred computation framework that delivers the replicability and efficiency of declarative pipelines to Python's machine learning ecosystem. It allows users to create pandas-like transformations that are memory-efficient, automatically cache intermediate outputs, and effortlessly switch between SQL engines and Python user-defined functions (UDFs), all while ensuring consistent, reproducible results.

🎁

Miscellaneous

Ringing Out the Old: AI's Role in Redefining Data Teams, Tools, and Business Models (53 minute podcast)

AI is forcing a major reset in how data teams operate, moving from tool-centric to problem-centric thinking. Teams must now prioritize adaptability, focusing on how AI can directly impact business outcomes rather than just optimizing data stacks. AI will compress the complexity of data tooling, making many low-level engineering tasks obsolete. Ultimately, success will hinge on how well teams can rethink their roles and workflows in light of AI's rapid integration.

A roadmap to scaling Postgres (7 minute read)

This tiered roadmap for data engineers breaks down Postgres scaling into stages—from mastering MVCC internals and optimizing data models to strategic indexing, partitioning, and caching—before advancing to sharding, hybrid storage engines, and derivative systems. It stresses cost‑effective hardware or managed DBaaS choices early on and reserves complex architectures for last.

⚡

Quick Links

Crystal (2 minute read)

Crystal is a search tool for government datasets.

FastAPI-MCP (GitHub Repo)

A zero-configuration tool for automatically exposing FastAPI endpoints as Model Context Protocol (MCP) tools.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://sparklp.co/32815a84/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Latest

Donate Your Car Now

Header Ads Widget

AI Redefining Data Teams 🤖, Spark DataFrame to Dataset 🏔️, Shifting Left ⬅️

TLDR Data 2025-04-17

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Report Abuse

Ad Space

Popular Posts

eGleams - Update from Catalyst for a Cure

Study is 'the ultimate test' of beta-amyloid hypothesis

Five things we learned about Alzheimer's research in 2019

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Get Lifetime Access To 1000+ Premium Online Training Courses For Just $59

Where to Buy Cheap Youtube Views?

Novell Zenworks MDM: Mobile Device Management For The Masses

Menu Footer Widget

Latest

Header Ads Widget

AI Redefining Data Teams 🤖, Spark DataFrame to Dataset 🏔️, Shifting Left ⬅️

TLDR Data 2025-04-17

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Social Plugin

Ad Space

Popular Posts

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Menu Footer Widget