Latest

6/recent/ticker-posts

Header Ads Widget

Slashing Snowflake Costs ❄️, Open-Source Agent Tradeoffs 🤖, Kafka’s New Bottleneck ⚙️

Snowflake cost and performance hinge on three separable layers: storage, compute, and cloud services, with the biggest savings coming from ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

TLDR Data 2026-05-28

📱

Deep Dives

Kafka Share Groups and Parallelizing Consumption — Tuning max.poll.records (14 minute read)

With Kafka Share Groups, the main bottleneck shifts from partition count to the combination of max.record.locks and max.poll.records. The default of 500 is often too high and causes “greedy capture” (a few consumers hog large batches). The recommended setting is roughly max.record.locks / consumers-per-partition (then tune slightly lower) for stable, high throughput.
How CockroachDB Built Vector Indexing at Scale (8 minute read)

CockroachDB built its own vector indexing system called C-SPANN to support scalable vector search because existing approaches like HNSW and IVF didn't fit its distributed architecture. C-SPANN uses a hierarchical K-means tree stored as regular table data, supports real-time inserts and deletes, and integrates natively with CockroachDB's sharding and rebalancing.
Design S3 Object Storage Like a Senior Engineer (31 minute read)

S3-scale object storage hinges on a flat, immutable namespace: buckets hold objects identified by keys, while metadata is separated from payload bytes so the system can scale independently. At ~100PB and hundreds of millions of objects, the design requires distributed metadata sharding, merged on-disk segment files to avoid inode exhaustion, and chunking of large objects for parallel reads and range requests.
🚀

Opinions & Advice

I Inherited a $140K Snowflake Bill — Three Months Later It Was $38K. Here's Everything I Learned (23 minute read)

Snowflake cost and performance hinge on three separable layers: storage, compute, and cloud services, with the biggest savings coming from right-sizing warehouses, aggressive auto-suspend, and reducing storage bloat from retention settings. The strongest optimization levers are physical data layout and query design: use clustering only when predicates match, avoid SELECT *, function-wrapped filters, and full reloads, and prefer incremental pipelines and pre-aggregation before joins.
I battletested 5 open source analytics agents (14 minute read)

Open-source “analytics agents” are often grouped together, but LangChain, Wren AI, nao, LibreChat, and Vercel's template solve very different problems, and only some are actually built for analytics. Reliable answers depend less on the agent interface and more on where business context lives, whether that's prompts, semantic models, markdown files, or the underlying MCP/tooling layer.
AI Risk Is an Architecture Problem (20 minute read)

AI risk should be assessed at the system level, not just the model level. The three mechanism risks of data exposure, incorrect output, and unintended action map to five business harms: brand, compliance, liability, operational, and commercial risk. The most important control is architecture: what the AI can see, what its output feeds into, and what it can do without checks. Adding human review, deterministic validations, and bounded permissions can sharply reduce action risk without changing the model.
💻

Launches & Tools

2026 State of Analytics Engineering Report by dbt Labs (Sponsor)

AI is speeding up analytics work, but the fundamentals still decide whether anyone trusts the output. dbt Labs' 2026 State of Analytics Engineering Report looks at AI-assisted coding, governance gaps, infrastructure costs, and the growing pressure to deliver reliable insights faster. Learn more.
RushDB 2.0: Memory Infrastructure for the Agentic Era (11 minute read)

RushDB 2.0 is an agent memory infrastructure that combines graph storage, semantic search, ontology/schema discovery, MCP access, skills, analytics queries, and BYO Neo4j into one layer. Agents need structured memory and reliable context, not a separate vector store, graph DB, and schema-discovery workflow stitched together manually.
MurrDB (GitHub Repo)

MurrDB is a fast NVMe/S3-backed serving cache for ML/AI inference built for batch reads/writes over large tabular data without keeping everything in RAM. It is a cheaper, lower-latency alternative to Redis for feature and document-attribute retrieval, not a general-purpose database.
Auditing Model Bias with Balanced Datasets with Mimesis (7 minute read)

The Mimesis library can create synthetic, balanced counterfactual datasets to test whether a model contains hidden bias, such as gender, age, or ethnicity, while keeping other features consistent. This helps teams measure prediction changes and detect unwanted bias in a safe, privacy-preserving way.
Scaling AI-Driven Marketing Processes with PostgreSQL (6 minute read)

Marketing teams can scale AI workflows reliably by using PostgreSQL as their central data layer via workflow state management (using ENUMs), combining relational tables with JSONB for flexibility, connecting campaigns/assets/performance data, and leveraging full-text search and pgvector for semantic context.
🎁

Miscellaneous

Open Data Product SDK: Turning Data Product Ideas Into Standard YAML With AI Models (5 minute read)

Open Data Product SDK now supports AI-assisted conversion of free-form text and Markdown into standards-ready YAML for data product catalogs, item-level specs, and ODPG graph context. The workflow captures product descriptions, use cases, business objectives, and signals, then generates ODPC Catalog YAML and connected portfolio metadata. The goal is to replace manual metadata editing with a standards-first path from stakeholder language to machine-readable data product definitions.
Deconstructing Data Sketches (8 minute read)

Data sketches estimate expensive metrics like distinct counts by storing a small probabilistic sample, such as the lowest K hashed values, instead of scanning every row. They trade perfect accuracy for huge speed and compute savings, making them useful for large-scale dashboards, reports, and distributed aggregation.

Quick Links

Visualize the Brrr (Website)

GPUs are the hidden engines driving today's AI revolution, but most developers treat them as mysterious, costly accelerators.
Announcing Polars 1.41 (2 minute read)

Polars 1.41 delivers three practical gains for analytical workloads: faster Parquet footer decoding for wide tables, deeper common subplan elimination across nested query branches, and new LazyFrame.gather() support for integer-based row selection without materializing data.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.

Post a Comment

0 Comments