Improving Pinterest Search Relevance Using Large Language Models (7 minute read) Pinterest enhanced search relevance on Pinterest Search by leveraging a cross-encoder LLM as a teacher model to predict multi-class relevance and then distilling its knowledge into a lightweight, real-time student model. This approach combined enriched text features from diverse Pin metadata with knowledge distillation and semi-supervised learning, leading to measurable improvements in search feed relevance and fulfillment rates. | Building Deep Research Agent from scratch (15 minute read) This guide details how to build a Deep Research Agent from scratch by leveraging LLMs, web search integration, and iterative reflection steps. It shows how to design the system's state using Python dataclasses, plan report outlines, enrich content via automated searches, and format the final report in Markdown, and offers a practical blueprint for advanced data processing and agent orchestration. | Towards Composable Data Infrastructure (10 minute read) Building a composable data infrastructure requires balancing vendor-optimized performance with ecosystem openness. While open standards aim to prevent vendor lock-in, practical implementations often introduce hidden dependencies that can hinder portability and scalability. Catalog federation can address this issue - a centralized write catalog ensures consistent metadata management and specialized read-only catalogs cater to specific query engines. This approach enhances interoperability, allowing organizations to maintain flexibility and performance across diverse systems without compromising on governance or efficiency. | | Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data (10 minute read) Discord re-engineered its dbt implementation to handle petabyte-scale data by developing custom solutions such as developer-specific table aliasing, time-based incremental processing, and automated backfill versioning to significantly reduce compile times and eliminate workflow clashes among 100+ developers working on over 2,500 models. These innovations not only boosted performance by 5x but also streamlined testing, collaboration, and cost efficiencies across its data stack. | Data Quality Evaluation: A 6-Step Framework Anyone Can Use (6 minute read) A key strategy to prevent data quality issues companies face is to monitor the five pillars of data health—freshness, volume, distribution, schema, and lineage—end-to-end to catch problems early. Even with robust testing, many issues remain undetected, surfacing later and causing disruptions, which underscores the need for automated detection and tracking. Effective data quality management requires a combination of technology, like data observability platforms, and stakeholder collaboration to set clear expectations and resolve incidents timely. | Why Your Data Pipeline Probably Isn't Production-Ready (8 minute read) Building a production-ready data pipeline involves more than moving data from Point A to Point B. It requires designing for backfills, avoiding duplicate data, and ensuring consistency. Implementing strategies such as data deletion by date, using merge statements, and modularizing workflow into logical steps helps maintain pipeline integrity and facilitate debugging. Prioritizing quality checks, robust logging, and clear alerts enhances reliability and scalability. Successful pipelines integrate these elements for long-term robustness, reducing maintenance burdens and future-proofing against operational challenges. | | Announcing the Agent2Agent Protocol (10 minute read) Google Cloud's Agent2Agent (A2A) Protocol is an open standard that enables autonomous AI agents to securely interact across diverse enterprise systems and applications. It leverages familiar protocols like HTTP and JSON-RPC to facilitate seamless capability discovery, task management, and real-time collaboration among agents built using various frameworks. The protocol is a collaborative effort with over 50 industry partners and is set to revolutionize multi-agent workflows in areas such as customer service, supply chain planning, and candidate sourcing. | Cocoindex (GitHub Repo) CocoIndex is an open-source engine designed for extracting, transforming, and indexing data with real-time incremental updates and custom logic for AI-ready pipelines. It offers a Python API for defining indexing flows that efficiently integrate with databases and supports embedding for semantic search applications. | | Book Review: How to Make Money with Data (6 minute read) Barbara Wixom's book "Data is Everyone's Business" underscores that all data management activities should either reduce costs or increase earnings to monetize data effectively. It suggests that organizations must eliminate slack resources to realize these benefits, employing strategies like improving, wrapping, and selling data. Wixom notes that siloed initiatives can hinder cohesive data strategies, urging leaders to develop enterprise-wide capabilities for data monetization. The book provides a framework for assessing data monetization opportunities and emphasizes the integration of technical and business capabilities to maximize data investments, offering practical insights for data leaders to drive tangible value. | Finding Bias in Data and ML models (4 minute read) Bias in machine learning models can stem from various sources, including biased training data, algorithm design, or human input during development, often leading to unfair outcomes for certain groups. Detecting bias requires evaluating model performance across different demographics to identify disparities and using tools like fairness metrics or the What-If Tool to test scenarios and uncover inconsistencies. Mitigation strategies include improving data diversity, retraining models with balanced datasets, and implementing fairness constraints during training to reduce systematic errors. Continuous monitoring and auditing of models post-deployment are essential to ensure they remain unbiased and effective over time. | | Love TLDR? Tell your friends and get rewards! | Share your referral link below with friends to get free TLDR swag! | | Track your referrals here. | | | |
0 Comments