SWE-RM: Execution-free Feedback For Software Engineering Agents

KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He

2025-12-29

SWE-RM: Execution-free Feedback For Software Engineering Agents

Summary

This paper focuses on improving how we train AI agents to write code, specifically by giving them better feedback. It explores different ways to evaluate code, moving beyond just whether it passes tests to understanding *how* good the code is.

What's the problem?

Currently, AI coding agents are often trained using two main methods: test-time scaling (TTS) where the agent tries different solutions until one passes tests, and reinforcement learning (RL) where the agent learns from rewards. Using tests to give feedback is common, but it's often not very helpful because you only know if the code works or doesn't, not *why*. Also, just because an AI can pick the best solution from a set of options doesn't mean it understands how to create good solutions on its own, which is important for learning. Different ways of checking code can give very different results when using reinforcement learning, even if they seem equally good at just passing tests.

What's the solution?

The researchers realized that for reinforcement learning, a good feedback system needs to be not only accurate in identifying correct code but also *calibrated*, meaning it needs to reliably reflect how confident it is in its assessment. They built a new reward model called SWE-RM, which is a large AI model designed to evaluate code. SWE-RM uses a 'mixture of experts' approach, meaning it combines the strengths of multiple smaller models. They carefully trained SWE-RM using a lot of data, experimenting with different data sources and training techniques to make it both accurate and well-calibrated.

Why it matters?

SWE-RM significantly improves the performance of AI coding agents on challenging coding benchmarks. It boosts the accuracy of existing models like Qwen3-Coder-Flash and Qwen3-Coder-Max, setting new performance records for open-source models. This work is important because better feedback systems are crucial for building more capable and reliable AI developers, and this research provides a strong foundation for future advancements in this area.

Abstract

Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

View Paper