Reducing Runtime Errors in Spark: Why We Migrated from DataFrame to Dataset (8 minute read) Migrating from DataFrame to Dataset in Apache Spark significantly reduces runtime errors by leveraging Scala's type safety to catch issues at compile time. DataFrames, being untyped, often lead to errors like column mismatches or null values that only surface during execution, disrupting pipelines. Datasets, with their strongly typed structure, ensure errors are caught early, improving debugging and reliability in complex data workflows. This shift enabled Agoda's team to streamline development, boost pipeline stability, and accelerate the delivery of data-driven features. | An LLM-Based Workflow for Automated Tabular Data Validation (6 minute read) Automating data cleaning for tabular datasets can be streamlined using a structured framework focused on data validity based on expected formats, types, and distributions. The process involves detecting and correcting errors by defining data expectations through semantic and statistical analysis, leveraging large language models (LLMs) to identify issues and suggest corrections. Effective data validation hinges on iterative rule-based checks, employing libraries like Pandera and Great Expectations, with a human-in-the-loop approach enhancing transparency and reliability. An implementation of the presented method is available. | Tinder's migration to Elasticsearch 8 (10 minute read) Tinder's ES6 to ES8 migration included three stages. Stage 1 involved achieving data consistency through dual writes to both ES6 and ES8 clusters. Stage 2 covered data reindexing into ES8, rigorous validation checks, performance benchmarking, and testing of critical functionalities like search relevance and recommendations. Stage 3 focused on deploying shadow traffic for real-time evaluation, gradually shifting live traffic while monitoring key metrics, and completing the cutover plus decommissioning the old cluster. | | What "Shifting Left" Means and Why it Matters for Data Stacks (22 minute read) Shifting left in data engineering involves moving data quality checks, business logic, and governance processes earlier in the data lifecycle—closer to the data source. This strategy aims to detect and resolve data issues promptly, reducing downstream errors and fostering more maintainable data systems. By adopting declarative tools like SQLMesh, organizations can centralize logic, streamline development workflows, and ensure consistent metric definitions across the data stack. This approach not only enhances data quality and system performance but also aligns data practices more closely with software engineering principles, promoting efficiency and scalability. | Fully Local Data Transformation with dbt and DuckDB (7 minute read) Engineers can now perform local data transformations using DuckDB and dbt with the dbt-duckdb adapter. A persisted DuckDB database, combined with external GeoJSON files, handles 400 MB of data in 40-45 seconds on a MacBook Pro, and the adapter's external materialization feature supports exporting to CSV, JSON, and Parquet files with a full refresh capability. DuckDB also integrates smoothly with a local PostgreSQL instance, enabling seamless querying of PostgreSQL data alongside DuckDB's efficient in-memory processing. | Ecosystem Considerations for Data Science (14 minute read) Organizations often rush to adopt AI, seeing it as a quick fix for business challenges, without integrating it into a comprehensive data ecosystem framework. Successful ML/AI implementations hinge not on standalone technology, but on a holistic data strategy encompassing quality data, robust infrastructure, algorithms, and human expertise. Key considerations include aligning AI projects with business strategy, ensuring data readiness, and fostering cross-departmental collaboration. Focusing on the full data ecosystem rather than isolated AI ventures reduces failure rates and enhances business impact. Businesses must prioritize broad ecosystem foundations alongside AI development for transformative success. | | Simplifying Data Pipelines with Durable Execution (40 minute podcast) Durable execution enables exactly-once processing in distributed systems, eliminating the need for external components like queues or CI pipelines. In this podcast, Jeremy Edberg, CEO of DBOS, discusses how their Transact library enables local resilience and streamlined orchestration through a serverless architecture. Managing version control in long-running workflows ensures updates don't disrupt execution. This approach is especially effective for complex pipelines and AI-driven systems that require reliability and maintainability. | Powering Semantic SQL for AI Agents with Apache DataFusion (12 minute read) Wren Engine is a semantic SQL engine powered by Apache DataFusion, designed to enhance AI agents' interaction with enterprise data. Wren Engine enables AI agents to comprehend complex data relationships and business logic by integrating a semantic layer through its Modeling Definition Language (MDL), facilitating accurate and efficient SQL query generation across diverse databases. This approach addresses challenges in SQL dialect compatibility and schema understanding, allowing AI agents to deliver reliable, context-aware insights. | xorq: Multi-engine ML pipelines made simple (GitHub Repo) Built on Ibis and DataFusion, xorq is a deferred computation framework that delivers the replicability and efficiency of declarative pipelines to Python's machine learning ecosystem. It allows users to create pandas-like transformations that are memory-efficient, automatically cache intermediate outputs, and effortlessly switch between SQL engines and Python user-defined functions (UDFs), all while ensuring consistent, reproducible results. | | Ringing Out the Old: AI's Role in Redefining Data Teams, Tools, and Business Models (53 minute podcast) AI is forcing a major reset in how data teams operate, moving from tool-centric to problem-centric thinking. Teams must now prioritize adaptability, focusing on how AI can directly impact business outcomes rather than just optimizing data stacks. AI will compress the complexity of data tooling, making many low-level engineering tasks obsolete. Ultimately, success will hinge on how well teams can rethink their roles and workflows in light of AI's rapid integration. | A roadmap to scaling Postgres (7 minute read) This tiered roadmap for data engineers breaks down Postgres scaling into stages—from mastering MVCC internals and optimizing data models to strategic indexing, partitioning, and caching—before advancing to sharding, hybrid storage engines, and derivative systems. It stresses cost‑effective hardware or managed DBaaS choices early on and reserves complex architectures for last. | | FastAPI-MCP (GitHub Repo) A zero-configuration tool for automatically exposing FastAPI endpoints as Model Context Protocol (MCP) tools. | | | Love TLDR? Tell your friends and get rewards! | | Share your referral link below with friends to get free TLDR swag! | | | | Track your referrals here. | | | |
0 Comments