BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen
2026-03-04
Summary
This paper introduces a new way to test how well AI can write and fix code, going beyond simple tasks to focus on more realistic challenges developers face every day.
What's the problem?
Current tests for AI coding tools only check if they can fix small problems within a single project. This doesn't reflect what developers actually do, which involves working with multiple projects, understanding specific areas of expertise, managing how different code pieces rely on each other, and even creating entire projects from scratch. These existing tests weren't measuring the full potential – or limitations – of these AI tools.
What's the solution?
The researchers created a benchmark called BeyondSWE with 500 real-world coding problems that require more complex skills like working across multiple projects and using specialized knowledge. They also built a system called SearchSWE that lets the AI use search engines to find information while coding, mimicking how human developers work. They then tested several advanced AI models on these new challenges to see how they performed.
Why it matters?
This work is important because it shows that even the best AI coding tools still have a long way to go before they can truly help developers with complex tasks. It also highlights that simply giving an AI access to search isn't enough – it needs to be able to effectively combine searching for information with actual coding and problem-solving skills. The new benchmark and testing framework will help researchers build and evaluate more capable AI coding assistants in the future.
Abstract
Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.