SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel

2025-05-29

SWE-rebench: An Automated Pipeline for Task Collection and
Decontaminated Evaluation of Software Engineering Agents

Summary

This paper talks about SWE-rebench, a new automated system that gathers real software engineering tasks from GitHub to help test and evaluate AI agents designed for software engineering.

What's the problem?

The problem is that current ways of testing AI models for software engineering often use artificial or repetitive tasks, which don't really show how well the AI would perform on real-world programming challenges. This makes it hard to know how good these models actually are.

What's the solution?

To solve this, the researchers built an automated pipeline that collects real, interactive coding tasks from GitHub and uses them to create a more realistic and challenging benchmark called SWE-rebench. This new benchmark also avoids using tasks that the AI might have already seen during training, making the evaluation more fair and accurate.

Why it matters?

This is important because it gives researchers and developers a better way to measure and improve AI tools for programming, which could lead to smarter software engineering assistants and better code in the future.

Abstract

A novel pipeline extracts real-world, interactive software engineering tasks from GitHub to create SWE-rebench, improving the evaluation of reinforcement learning models in SWE.

View Paper