SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang
2025-02-26
Summary
This paper talks about SWE-RL, a new method that uses reinforcement learning to improve how AI models think and solve problems in real-world software engineering tasks.
What's the problem?
AI models are getting better at solving coding challenges, but most of the progress has been focused on simple, isolated problems like those in competitive programming. These methods don’t work well for real-world software engineering, where tasks are more complex and involve understanding the entire history of a project, such as code changes and bug fixes.
What's the solution?
The researchers developed SWE-RL, which trains AI models using data from the evolution of open-source software projects, like GitHub pull requests and issue tracking. They used a reward system to teach the model how to reason through problems by comparing its solutions to real-world fixes. This approach allowed the AI to learn from how developers actually solve problems over time, making it better at handling complex tasks. The resulting model, Llama3-SWE-RL-70B, showed significant improvements in solving real-world coding issues and even developed general reasoning skills that worked outside of software tasks.
Why it matters?
This matters because it pushes AI closer to being useful in real-world software development, not just in controlled environments. By training on real project data, SWE-RL helps create smarter AI tools that can assist developers with debugging, code fixes, and other complex tasks. It also shows how reinforcement learning can be applied to improve reasoning in large language models for broader applications.
Abstract
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-<PRE_TAG>SWE-RL-70B</POST_TAG>, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.