SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui
2025-12-25
Summary
This paper introduces a new way to test AI coding assistants, moving beyond simple tasks to focus on how well they can handle realistic software development projects that require many changes over time.
What's the problem?
Current tests for AI coding agents only check if they can fix a single bug or add a small feature. Real software development is much more complex, involving understanding what a project *should* do, planning out changes across many different code files, and making sure new code doesn't break existing features. Existing benchmarks don't accurately reflect these challenges, so it's hard to know if AI can actually help with real-world coding.
What's the solution?
The researchers created a benchmark called SWE-EVO, which uses the actual history of seven popular Python projects. This benchmark gives the AI tasks that require it to make changes to many files at once – an average of 21 files per task – and then verifies that the changes work with a large set of tests (around 874 tests per task). They then tested some of the most advanced AI models, including GPT-5, on this benchmark.
Why it matters?
The results showed that even the best AI models struggle with these more complex, long-term coding tasks. GPT-5 only solved 21% of the tasks, compared to 65% on simpler tests. This highlights a significant gap in the capabilities of current AI coding assistants and shows that more research is needed to make them truly useful for real software development. The paper also introduces a new way to measure *how much* progress an AI makes on a complex task, even if it doesn't fully solve it.
Abstract
Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.