Uber's Strategy to Upgrading 2M+ Spark Jobs (10 minute read) Uber upgraded from Spark 2.4 to Spark 3.3, migrating over 40,000 Spark applications and 2,100 applications in just six months. It automated the process using Polyglot Piranha, an open-source tool that parses and rewrites code by converting it into an Abstract Syntax Tree (AST) and applying transformation rules to enable bulk changes across applications. | How Kafka Works (20 minute read) Apache Kafka is an open-source distributed messaging system that enables high-throughput processing by storing key-value records in immutable, offset-ordered logs within topics sharded into partitions. Its architecture relies on clusters of at least three brokers for fault-tolerant replication (with a default factor of three), where leaders handle writes, followers replicate data, and the KRaft protocol replaces ZooKeeper for streamlined coordination and recovery. | Switching Me Softly (7 minute read) Fresha achieved zero-downtime PostgreSQL upgrades across 20+ critical production databases (PG12 to PG17) by developing a configurable, automated orchestration framework, handling Debezium CDC, outbox event ordering, replication slots, and PgBouncer cutovers, with no downtime or data loss. YAML-driven scripts enabled repeatable, reversible migrations tailored per-database, overcoming limitations of RDS Blue/Green and in-place upgrades. | How We Scaled Raw GROUP BY to 100 B+ Rows In Under A Second (30 minute read) ClickHouse introduced Parallel Replicas to enable infinite horizontal scaling for GROUP BY queries on massive datasets, such as aggregating 100 billion rows in under a second. This addresses the growing demands of analytical workloads like observability and AI analytics, where GROUP BY is prevalent in over half of BI queries. | | Apache Parquet vs. Newer File Formats (BtrBlocks, FastLanes, Lance, Vortex) (7 minute read) Apache Parquet has dominated as a columnar file format for over a decade, excelling in large-scale analytical workloads with features like columnar layout, compression, and broad ecosystem support from tools like Spark and Iceberg. However, newer formats like BtrBlocks, FastLanes, Lance, Vortex, and Nimble address modern needs such as AI pipelines, GPU acceleration, and low-latency access on hardware like NVMe and SIMD-enabled processors. | A SQL Heuristic: ORs Are Expensive (10 minute read) OR conditions in SQL can be very costly because query planners often fall back to sequential scans or expensive index merges, while AND clauses fit compound indexes naturally. Rewriting ORs as unions or restructuring schema (e.g., extension tables) can cut query times by 100x and make access patterns clearer. | | Apache Gravitino (GitHub Repo) With 1.0 now released, Apache Gravitino is an open-source alternative to Unity Catalog. It doesn't replace Unity Catalog or Snowflake's governance. Instead, it complements them by acting as a layer above multiple systems, and works across Hive, Iceberg, Kafka, S3, and ML model registries. It supports out-of-the-box connectors for multiple platforms and MCP servers. | Apache DataFusion 50.0.0 Released (6 minute read) DataFusion 50.0.0 introduces significant performance enhancements, including dynamic filter pushdown for inner hash joins, yielding order-of-magnitude improvements in scan efficiency, a rewritten nested loop join operator with up to 5x faster execution and 99% lower memory usage, and automatic Parquet metadata caching delivering 12x faster point queries. Key new features support robust disk-spilling sorts, QUALIFY and FILTER clauses for advanced analytics, and expanded Apache Spark compatibility. | ChartDB (Tool) ChartDB helps engineers instantly turn database schemas into ER diagrams with AI-assisted editing, real-time collaboration, and auto-sync to live databases. It supports major DBs (Postgres, MySQL, SQL Server, Oracle, etc.), generates clean DDL, and produces shareable, versioned documentation. | | The Model Selection Showdown: 6 Considerations for Choosing the Best Model (5 minute read) Choosing the best machine learning model hinges on six essential considerations: clearly defining goals, establishing a baseline, selecting the right metrics, applying cross-validation, balancing complexity and interpretability, and validating with real-world data. The key is to align with the problem, dataset, and stakeholder needs over advanced algorithms. | Perplexity Launches Search API to Power Next-Gen AI Applications (2 minute read) Perplexity has launched the Search API, providing real-time, pre-ranked web snippets from an index spanning hundreds of billions of webpages, updating tens of thousands of documents per second. Tailored for AI-driven agents and retrieval-augmented pipelines, it streamlines grounding LLMs and accelerates integration by minimizing the need for preprocessing. Initial open-source benchmarks show superior quality and latency compared to alternatives, with developer tooling enabling rapid prototyping and lower operating costs. | | Love TLDR? Tell your friends and get rewards! | Share your referral link below with friends to get free TLDR swag! | | Track your referrals here. | | | |
0 Comments