Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

2025-12-19

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Summary

This paper investigates how to best train large language models (LLMs) to improve their reasoning skills, specifically focusing on a technique called Reinforcement Learning with Verifiable Rewards (RLVR). It looks at a surprising situation where both encouraging the model to explore different answers *and* discouraging it from exploring actually make it better at reasoning.

What's the problem?

The researchers noticed that RLVR works well even when it uses 'spurious rewards' – rewards given for answers that aren't actually correct. It also works when the model is pushed to be very confident in its answers, which limits exploration. This seems contradictory: why would discouraging both exploration *and* exploitation lead to better results? The core issue is understanding *how* these seemingly negative things actually improve the model's reasoning ability.

What's the solution?

The researchers found that the spurious rewards actually reduce the randomness in the model's choices, making it more focused and confident. This happens because of something called 'clipping bias,' which influences how the model learns from those incorrect rewards. They also showed that simply reducing exploration isn't enough; the spurious rewards are key. They developed a model to explain why these rewards work even when the model isn't just memorizing the 'wrong' answers.

Why it matters?

This work is important because it clarifies *why* RLVR is effective, even with its unusual components. Understanding these mechanisms allows researchers to design better training methods for LLMs, leading to more reliable and accurate reasoning capabilities in these powerful AI systems. It provides principles for making RLVR training more efficient and effective.

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

View Paper