📱

Deep Dives

Uber's Strategy to Upgrading 2M+ Spark Jobs (10 minute read)

Uber upgraded from Spark 2.4 to Spark 3.3, migrating over 40,000 Spark applications and 2,100 applications in just six months. It automated the process using Polyglot Piranha, an open-source tool that parses and rewrites code by converting it into an Abstract Syntax Tree (AST) and applying transformation rules to enable bulk changes across applications.

How Kafka Works (20 minute read)

Apache Kafka is an open-source distributed messaging system that enables high-throughput processing by storing key-value records in immutable, offset-ordered logs within topics sharded into partitions. Its architecture relies on clusters of at least three brokers for fault-tolerant replication (with a default factor of three), where leaders handle writes, followers replicate data, and the KRaft protocol replaces ZooKeeper for streamlined coordination and recovery.

Switching Me Softly (7 minute read)

Fresha achieved zero-downtime PostgreSQL upgrades across 20+ critical production databases (PG12 to PG17) by developing a configurable, automated orchestration framework, handling Debezium CDC, outbox event ordering, replication slots, and PgBouncer cutovers, with no downtime or data loss. YAML-driven scripts enabled repeatable, reversible migrations tailored per-database, overcoming limitations of RDS Blue/Green and in-place upgrades.

How We Scaled Raw GROUP BY to 100 B+ Rows In Under A Second (30 minute read)

ClickHouse introduced Parallel Replicas to enable infinite horizontal scaling for GROUP BY queries on massive datasets, such as aggregating 100 billion rows in under a second. This addresses the growing demands of analytical workloads like observability and AI analytics, where GROUP BY is prevalent in over half of BI queries.

🚀

Opinions & Advice

Apache Parquet vs. Newer File Formats (BtrBlocks, FastLanes, Lance, Vortex) (7 minute read)

Apache Parquet has dominated as a columnar file format for over a decade, excelling in large-scale analytical workloads with features like columnar layout, compression, and broad ecosystem support from tools like Spark and Iceberg. However, newer formats like BtrBlocks, FastLanes, Lance, Vortex, and Nimble address modern needs such as AI pipelines, GPU acceleration, and low-latency access on hardware like NVMe and SIMD-enabled processors.

A SQL Heuristic: ORs Are Expensive (10 minute read)

OR conditions in SQL can be very costly because query planners often fall back to sequential scans or expensive index merges, while AND clauses fit compound indexes naturally. Rewriting ORs as unions or restructuring schema (e.g., extension tables) can cut query times by 100x and make access patterns clearer.

💻

Launches & Tools

Apache Gravitino (GitHub Repo)

With 1.0 now released, Apache Gravitino is an open-source alternative to Unity Catalog. It doesn't replace Unity Catalog or Snowflake's governance. Instead, it complements them by acting as a layer above multiple systems, and works across Hive, Iceberg, Kafka, S3, and ML model registries. It supports out-of-the-box connectors for multiple platforms and MCP servers.

Apache DataFusion 50.0.0 Released (6 minute read)

DataFusion 50.0.0 introduces significant performance enhancements, including dynamic filter pushdown for inner hash joins, yielding order-of-magnitude improvements in scan efficiency, a rewritten nested loop join operator with up to 5x faster execution and 99% lower memory usage, and automatic Parquet metadata caching delivering 12x faster point queries. Key new features support robust disk-spilling sorts, QUALIFY and FILTER clauses for advanced analytics, and expanded Apache Spark compatibility.

ChartDB (Tool)

ChartDB helps engineers instantly turn database schemas into ER diagrams with AI-assisted editing, real-time collaboration, and auto-sync to live databases. It supports major DBs (Postgres, MySQL, SQL Server, Oracle, etc.), generates clean DDL, and produces shareable, versioned documentation.

🎁

Miscellaneous

The Model Selection Showdown: 6 Considerations for Choosing the Best Model (5 minute read)

Choosing the best machine learning model hinges on six essential considerations: clearly defining goals, establishing a baseline, selecting the right metrics, applying cross-validation, balancing complexity and interpretability, and validating with real-world data. The key is to align with the problem, dataset, and stakeholder needs over advanced algorithms.

Perplexity Launches Search API to Power Next-Gen AI Applications (2 minute read)

Perplexity has launched the Search API, providing real-time, pre-ranked web snippets from an index spanning hundreds of billions of webpages, updating tens of thousands of documents per second. Tailored for AI-driven agents and retrieval-augmented pipelines, it streamlines grounding LLMs and accelerates integration by minimizing the need for preprocessing. Initial open-source benchmarks show superior quality and latency compared to alternatives, with developer tooling enabling rapid prototyping and lower operating costs.

⚡

Quick Links

The Great Consolidation is underway (2 minute read)

The "Great Consolidation" in data engineering is accelerating, as mergers like Fivetran's highlight a maturing but overhyped market where tools are merging.

AWS Glue Iceberg Rest Catalog (5 minute read)

How to set up a local Iceberg + Spark environment using the Iceberg REST catalog to mimic a Glue 5.0 environment, enabling ETL testing without EMR spend.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://sparklp.co/32815a84/11

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Latest

Donate Your Car Now

Header Ads Widget

Migrating Millions of Spark Jobs ⚡️, How Kafka Works 📨, Unity Catalog Alternative 🌐

TLDR Data 2025-10-02

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Report Abuse

Ad Space

Popular Posts

Celebrate National Family Caregivers Month.

Country music legend Glen Campbell dies

Coming soon: Alzheimer's Association International Conference 2017

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Get Lifetime Access To 1000+ Premium Online Training Courses For Just $59

Where to Buy Cheap Youtube Views?

Novell Zenworks MDM: Mobile Device Management For The Masses

Menu Footer Widget

Latest

Header Ads Widget

Migrating Millions of Spark Jobs ⚡️, How Kafka Works 📨, Unity Catalog Alternative 🌐

TLDR Data 2025-10-02

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

Post a Comment

0 Comments

Search This Blog

Social Plugin

Ad Space

Popular Posts

Subscribe Us

Labels

Technology

Random Posts

Recent in Sports

Popular Posts

Menu Footer Widget