SWE-bench Goes Live!
Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
2025-05-30
Summary
This paper talks about SWE-bench-Live, a new tool that constantly updates itself to test how well AI models can help solve real issues posted on GitHub, which is a popular platform for coding projects.
What's the problem?
The problem is that most benchmarks for testing AI on programming tasks use old or static data, which doesn't reflect the latest challenges developers face, and can sometimes get 'contaminated' if models have already seen the answers during training.
What's the solution?
The researchers built SWE-bench-Live to automatically pull in new, real GitHub issues and organize them in a way that makes it easy to test AI models fairly and at scale, while making sure the models haven't already seen the answers.
Why it matters?
This is important because it helps developers and researchers know how well AI models really perform on up-to-date, real-world programming problems, making AI tools more trustworthy and useful for the software industry.
Abstract
SWE-bench-Live is a continuously updatable benchmark for evaluating LLMs in issue resolution, featuring live GitHub issues and automated curation to ensure scalability and contamination resistance.