Self-Distilled RLVR
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan
2026-04-06
Summary
This paper investigates a technique for training large language models (LLMs) that combines the benefits of two existing methods: reinforcement learning with verifiable rewards (RLVR) and on-policy self-distillation (OPSD). It identifies a key flaw in current self-distillation approaches and proposes a new method, RLSD, to overcome it.
What's the problem?
Currently, a popular way to train LLMs involves letting a model learn from itself, acting as both the 'teacher' providing guidance and the 'student' receiving it. This self-distillation relies on giving the teacher extra information, like ideal answers. However, the researchers found that giving the teacher this extra information causes it to unintentionally 'leak' those answers to the student, leading to unstable training and limiting how well the model can ultimately perform. Essentially, the student becomes too reliant on the privileged information instead of learning to generate good responses on its own.
What's the solution?
The researchers propose a new method called RLSD, which stands for RLVR with Self-Distillation. Instead of relying solely on the teacher's privileged information, RLSD uses self-distillation to pinpoint *how* the model's responses differ at each step from the teacher's, and uses this to adjust the learning process. Simultaneously, it continues to use RLVR, which provides reliable feedback from the environment (like checking if an answer is correct) to guide the overall direction of learning. This way, the model benefits from detailed guidance on *how* to improve, while still being grounded in real-world feedback on *whether* its responses are good.
Why it matters?
This research is important because it improves the stability and performance of training large language models. By combining the strengths of two existing methods and addressing a key weakness in self-distillation, RLSD allows models to learn more effectively and achieve better results. This could lead to more capable and reliable AI systems in the future.
Abstract
On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.