Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel
2025-08-07
Summary
This paper talks about training AI agents that use Large Language Models (LLMs) to work on software engineering tasks that involve many steps and back-and-forth interactions with the environment, instead of just one-step problems like solving math or writing a single line of code.
What's the problem?
The problem is that most existing reinforcement learning applications for these models focus on simple tasks where the AI gets no real feedback after each step. But real-world software engineering needs agents to interact many times, with each action affecting the environment and receiving real responses, which is much more complicated to train for.
What's the solution?
The researchers used a special training method called Decoupled Advantage Policy Optimization and applied it to a large AI model named Qwen2.5-72B-Instruct. This approach helped the agent learn to solve multi-step software engineering problems better, boosting its success rate from 20% with previous methods to 39%, and outperforming or matching other top models without needing help from teacher models.
Why it matters?
This matters because it shows a way to build smarter AI agents that can handle complex, real-world tasks requiring long conversations and steps, like software development. It opens the door to creating more capable and reliable AI tools that can assist programmers and solve difficult problems on their own using open and accessible AI models.
Abstract
Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.