SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, Parthasarathy Ranganathan

2025-11-11

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Summary

This paper introduces a new way to test how well computer programs can automatically improve the speed of existing software projects, focusing on real-world code rather than simple problems.

What's the problem?

Currently, most tests for software improvement focus on *finding* errors, not on *fixing* performance issues. Improving software speed requires understanding how code works and pinpointing exactly where it's slow, which is a difficult task for computers. Existing benchmarks don't really evaluate this 'how-to-fix' skill, leaving a gap in assessing automated performance optimization tools.

What's the solution?

The researchers created a benchmark called SWE-fficiency. It gives programs a complete software project and a slow task, and challenges them to make the code faster. They built a system that automatically finds real-world examples of performance improvements made by human developers on GitHub, using these as a target for the automated programs to try and match. The system checks if the automated changes actually make the code faster and don't break existing tests.

Why it matters?

This work is important because it provides a realistic and challenging test for automated software optimization. It shows that current automated tools aren't very good at improving performance, highlighting areas where more research is needed to create programs that can effectively speed up software projects without introducing errors. The benchmark and data released with the paper will help other researchers develop and test better automated performance engineering tools.

Abstract

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

View Paper