Claude Mythos (3 minute read) 'Mythos' is the name for a new tier of Anthropic models that are larger and more intelligent than Opus. The models get dramatically higher scores on tests of software coding, academic reasoning, and cybersecurity compared to Claude Opus 4.6. Mythos is a large, compute-intensive model that is very expensive to use and serve. Anthropic is working on making the model much more efficient before any general release. | Meta tests Avocado 9B, Avocado Mango Agent, and more (2 minute read) Meta's Avocado model has been pushed back to at least May as it still falls short of leading systems from competitors. The company appears to be running parallel experiments with multiple Avocado variants. The model appears to be able to solve complex math problems that earlier Llama models could not, but these problems have already been solved by other labs months earlier. Meta's AI leadership has reportedly discussed temporarily licensing Google's Gemini technology. Some requests within Meta AI are already being routed through Gemini models. | | Function Calling Harness: From 6.75% to 100% (32 minute read) AutoBe is an open-source AI agent that takes a single natural language conversation and generates a complete backend. qwen3-coder-next has a 6.75% function calling success rate when asked to generate API data types for a shopping mall backend. AutoBe boosts that success rate up to over 99.8%. It uses a harness where type schemas constrain outputs, compilers verify results, and structure feedback pinpoints compactly where and why something went wrong so the agent can correct itself. This post dissects the engineering behind AutoBe. | AI's capability improvements haven't come from it getting less affordable (12 minute read) AI's capability improvements at the frontier have not led to increased inference costs relative to human labor. Despite rising per-task inference costs, current models achieve tasks at roughly 3% of human costs without any upward trend in median cost ratios. Models can continue advancing even under strict cost constraints, enabling profitable automation with AI cost ratios remaining well below human levels. | The Capability Overhang in AI (4 minute read) Coding agents outperform other domains because codebases provide a self-contained environment of critical context, unlike fragmented knowledge work spread across video calls and legacy systems. Enterprise adoption remains stalled by the three hard problems of context fragmentation, complex access control, and a rapidly shifting architecture landscape. | | Schedule tasks on the web (5 minute read) Claude Code on the web users can now schedule tasks. The tasks will run on Anthropic-managed infrastructure, so they will keep working even if users turn off their devices. Scheduled tasks are available to all Claude Code on the web users. Example tasks include reviewing open pull requests each morning, analyzing CI failures overnight and surfacing summaries, syncing documentation after PRs merge, and running dependency audits every week. | lat.md (GitHub Repo) lat.md is a spec that agents keep in sync with the code base that helps them understand big ideas and key business logic. It ensures that corner cases have proper high-level tests that matter and can speed up coding by saving agents from endless grepping. The spec uses plain Markdown, with Wiki links connecting concepts into a navigable graph. | What Pretext Reinforced About AI Loops (5 minute read) Pretext is a fast, accurate, comprehensive text measurement algorithm that can lay out web pages without leaning on DOM measurement and reflow. It was created using AI agent workflows. The particular loop that was used in developing the tool (constrain -> measure -> isolate -> classify -> test -> reject -> keep only what survives broad pressure) made the engineering rigorous. This article analyzes the loop to see what makes it so successful. | | Things I learned at OpenAI (7 minute read) OpenAI alumni emphasize the significance of creating effective evaluations and benchmarks, noting that the best benchmarks drive collective optimization efforts. Post-training data design and model alignment are critical for unlocking new AI capabilities, particularly in subjective attributes like empathy or creativity. Fast iteration, choosing the right problems, and leveraging internal tooling are key competitive advantages in AI research. | | | Love TLDR? Tell your friends and get rewards! | | Share your referral link below with friends to get free TLDR swag! | | | | Track your referrals here. | | | |
0 Comments