SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, Gabriel Maduekwe

2025-12-22

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Summary

This paper introduces SWE-Bench++, a new way to test how well large language models (LLMs) can handle real-world software engineering tasks, like fixing bugs and adding features to existing projects.

What's the problem?

Current benchmarks for evaluating LLMs on coding tasks are limited because they are often created by hand, use unchanging datasets, and mostly focus on Python. This means they don't accurately reflect the variety of challenges developers face when working with different programming languages and evolving projects.

What's the solution?

The researchers created SWE-Bench++ which automatically generates coding challenges by taking actual pull requests from GitHub. It takes these requests, sets up the necessary environment to run the code, figures out how to check if the LLM's solution works correctly, and ensures the tasks are high quality. They even create example solutions from failed attempts to help train the models. The initial version includes over 11,000 tasks from nearly 4,000 projects in 11 different programming languages.

Why it matters?

SWE-Bench++ is important because it provides a more realistic and scalable way to evaluate and improve LLMs for software development. By using real-world code changes and supporting multiple languages, it helps push the boundaries of what these models can do and ultimately assists developers in their work.

Abstract

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.

View Paper