ChatGPT Adds Tone Personalization (1 minute read) OpenAI has introduced new personalization options in ChatGPT, letting users adjust enthusiasm, warmth, and emoji use directly. These controls, available in the Personalization menu, offer "More," "Less," or "Default" settings, expanding tone customization beyond the existing base style and tone feature. | Cursor Acquires Graphite (2 minute read) Cursor has acquired Graphite, a company known for its performance-focused internal developer portal. This marks Cursor's third acquisition as it aims to build a comprehensive AI-powered dev platform. | Introducing Bloom: an open source tool for automated behavioral evaluations (7 minute read) Anthropic's Bloom is an open-source tool for generating automated behavioral evaluations of AI models. Bloom assesses specific behaviors like self-preferential bias and sabotage by creating scenarios and quantifying behavior occurrence across models. It efficiently differentiates between aligned and misaligned models and correlates strongly with human judgment, enabling scalable and reliable behavior evaluations. | | The changing drivers of LLM adoption (15 minute read) LLM use is rising. People are increasingly using different LLMs, different products, and in different places. ChatGPT remains dominant and keeps acquiring new users, but Gemini's growth has been faster over the last few months. OpenAI's revenue seems to be on track, but consumer revenue is likely decreasing as a share. A substantial share of workplace AI use involves workers adopting tools on their own rather than waiting for employer-provided access. | Evaluating Context Compression for AI Agents (10 minute read) What happens when agents run out of memory determines whether they continue productively or have to start from scratch. This post explores an evaluation framework that measures how much context different compression strategies preserve. Structured summarization retains more useful information than alternative methods without sacrificing compression efficiency. | Understanding AI Benchmarks (25 minute read) Benchmarks are the most widely misunderstood part of the AI ecosystem. The narrative keeps implying a universal increase in intelligence, but the numbers can be misleading. To navigate this noise, look at the aggregate, look at the relative, and verify with your own tasks. The only benchmark that matters at the end of the day is your own workload. | Experiment Diary (3 minute read) This document contains a diary for an experiment aimed at teaching an LLM using GRPO to generate regex given a description. It details the performance, learnings, modifications, and key takeaways from each experiment. The initial training run was on December 17. It saw the model quickly learning how to generate valid regex tags, but the model was basically generating random regex strings. | | Qwen-Image-Layered (GitHub Repo) Qwen-Image-Layered is a model capable of decomposing an image into multiple RGBA layers. Each layer can be independently manipulated without affecting other content. They can be resized, repositioned, and recolored. The approach enables high-fidelity and consistent editing. | Introducing MiMo-V2-Flash (10 minute read) MiMo-V2-Flash is a powerful, efficient, and ultra-fast foundational language model that excels in reasoning, coding, and agentic scenarios. It serves as an excellent general-purpose assistant for everyday tasks. The model is available globally on Hugging Face, AI Studio, and Xiaomi's API platform. Benchmark results are available in the article. | jax-js (GitHub Repo) jax-js is a machine learning framework for the browser. It brings JAX-style, high-performance CPU and GPU kernels to JavaScript, so users can run numerical applications on the web. The library is written from scratch and has no external dependencies. It can run anywhere a browser can run. | Multiplexing MCP Servers For Agentic Specialization (8 minute read) MCP servers give agents the tools they need to accomplish tasks. This post discusses how to multiplex MCP servers to simplify the connection to various tools within them. Multiplexing allows multiple MCP servers to be used over a gateway in a single interaction. It allows agents to access multiple MCP servers with different stacks, clouds, applications, and frameworks for specialized tasks. | tcgen05 for dummies (70 minute read) tcgen05 is the set of PTX instructions that programs Tensor Cores on the latest NVIDIA Blackwell GPUs. This post contains a tutorial for Blackwell in plain CUDA C++ with PTX. It documents the author's process of learning tcgen05 and reaching 98% of CuBLAS speed. Readers can follow the tutorial using Modal or any other B200 cloud providers. | | How to game the METR plot (9 minute read) METR topics are public, making it easy to game METR horizon length measurements for a frontier lab. The horizon length under METR's assumptions might be adding little information beyond benchmark accuracy. There is a meme going around based on a team achieving a one to four hour range on the METR plot. This post explains why the plot has been interpreted incorrectly. | | | Love TLDR? Tell your friends and get rewards! | | Share your referral link below with friends to get free TLDR swag! | | | | Track your referrals here. | | | |
0 Comments