The Path Not Taken: RLVR Provably Learns Off the Principals

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai

2025-11-12

The Path Not Taken: RLVR Provably Learns Off the Principals

Summary

This research investigates why Reinforcement Learning with Verifiable Rewards, or RLVR, surprisingly improves large language models with only small changes to the model's overall settings. It turns out these changes aren't random, but follow a specific pattern related to how the model learns.

What's the problem?

RLVR is effective at making language models better at reasoning, but it only seems to adjust a small number of the model’s parameters. This is puzzling because you’d expect significant improvements to require more widespread changes. The researchers wanted to understand *why* RLVR works so well despite this apparent lack of large-scale adjustments and if there was a hidden order to the changes it *does* make.

What's the solution?

The researchers discovered that RLVR doesn’t randomly tweak parameters; instead, it consistently focuses on specific, preferred areas within the model. They developed a 'Three-Gate Theory' to explain this. The first 'gate' limits how much the model can change with each update. The second 'gate' directs changes away from the most important directions and into areas that don't drastically alter the model's core function. The third 'gate' hides small adjustments in less important areas, making it *look* like very few parameters are changing overall. They confirmed this by carefully tracking which parameters are updated during RLVR training and comparing it to how models learn during a more standard training process called Supervised Fine-Tuning (SFT).

Why it matters?

This work is important because it provides a deeper understanding of how RLVR actually works, moving beyond just knowing *that* it works. It shows that RLVR operates differently than traditional training methods, meaning that techniques designed for standard training might not be the best way to improve RLVR. This research paves the way for designing new, more effective learning algorithms specifically tailored for RLVR, potentially leading to even more powerful language models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.

View Paper