Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou
2025-12-29
Summary
This paper investigates how different types of examples used to train powerful reasoning models impact how well those models learn to think through problems.
What's the problem?
Large reasoning models are improved using a technique where the model essentially practices solving problems and gets rewarded for correct steps. This practice involves both 'positive' examples – where the model gets it right – and 'negative' examples – where it makes mistakes. It wasn't clear how these two types of examples, and how much weight to give them, affected the learning process. Specifically, researchers didn't fully understand if positive examples just reinforced what the model already knew, or if negative examples were crucial for discovering new ways to solve problems.
What's the solution?
The researchers carefully studied how positive and negative examples influence the model's learning. They found that positive examples refine existing correct reasoning, while negative examples push the model to explore different solution paths. Building on this, they developed a new method called A3PO which intelligently adjusts how much 'credit' or 'blame' is assigned to individual parts of the model’s answer, depending on whether it came from a positive or negative example. This adjustment happens at a very detailed level, looking at each piece of the answer.
Why it matters?
This work is important because it provides a better understanding of how to train these complex reasoning models. By carefully balancing positive and negative feedback and focusing that feedback on the most important parts of the answer, we can build models that are not only more accurate but also more capable of tackling new and challenging problems. This could lead to improvements in areas like question answering, problem solving, and even artificial intelligence in general.
Abstract
Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.