
Amid a push toward AI agents, with both Anthropic and OpenAI releasing multi-agent systems this week, Anthropic is primed to showcase some of its bolder AI coding experiments. However, as is typical with AI achievement claims, there are notable caveats.
On Thursday, Anthropic researcher Nicholas Carlini published a blog post explaining how he unleashed 16 instances of the company’s Claude Opus 4.6 model on a shared codebase with minimal oversight, assigning them the task of building a C compiler from the ground up.
Across two weeks and roughly 2,000 Claude Code sessions — at an estimated $20,000 in API fees — the model agents reportedly produced a 100,000-line, Rust-based compiler capable of producing a bootable Linux 6.9 kernel for x86, ARM, and RISC-V platforms.
Carlini, a research scientist on Anthropic’s Safeguards team who previously spent seven years at Google Brain and DeepMind, employed a new Claude Opus 4.6 feature called “agent teams.” In the experiment, each Claude instance ran in its own Docker container, cloned a shared Git repository, claimed tasks by creating lock files, and pushed completed code upstream. No central orchestration agent managed traffic. Each instance autonomously picked what seemed the next most pressing problem and began working on it. When merge conflicts appeared, the model instances resolved them on their own.
The compiler that resulted, which Anthropic has released on GitHub, can build a range of major open-source projects including PostgreSQL, SQLite, Redis, FFmpeg, and QEMU. It achieved a 99 percent pass rate on the GCC torture test suite and — in what Carlini called “the developer’s ultimate litmus test” — compiled and ran Doom.
It’s important to note that creating a C compiler is an especially suitable task for semi-autonomous AI coding: the language specification is decades old and well-defined, comprehensive test suites already exist, and there’s a known-good reference compiler for comparison. Most real-world software projects lack those advantages. Often the hardest part of development isn’t writing code that passes tests, but deciding which tests to write in the first place.