SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, Aws Albarghouthi
2026-03-28
Summary
This research investigates how well AI coding agents can actually *build* software, not just pass tests. It points out that current ways of testing AI code focus too much on getting a single correct answer and don't check how easily the code can be improved or expanded later on.
What's the problem?
Existing benchmarks for AI coding usually give the AI a complete problem description and expect a perfect solution immediately. This doesn't reflect how real software development works, which is an iterative process of building, testing, and refining. The problem is that code that passes tests might be poorly structured and difficult to modify as requirements change. Current iterative benchmarks don't accurately measure how code quality impacts future development.
What's the solution?
The researchers created a new benchmark called SlopCodeBench. This benchmark gives AI agents problems and then asks them to repeatedly improve their *own* code as the problem description evolves. The AI isn't given a perfect specification upfront; instead, it has to make design choices and adapt its code over time. They then measured how 'clean' the code was, looking at things like how much unnecessary code there was and how complex individual parts of the code became.
Why it matters?
The results show that AI agents struggle to maintain good code quality over multiple iterations. Their code gets messier and more complex much faster than code written by human developers. This means that simply achieving a passing score on a test doesn't guarantee the AI has created robust, maintainable software. It highlights the need for better ways to evaluate AI coding abilities and for AI agents to learn better software design principles.
Abstract
Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.