Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua

2026-04-07

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Summary

This paper focuses on improving how we fine-tune large language models using a technique called reinforcement learning with verifiable rewards, specifically addressing issues with existing methods like GRPO and SDPO.

What's the problem?

Currently, two main approaches are used to improve these models. GRPO is good overall but doesn't pinpoint *exactly* where the model goes wrong in its responses. SDPO is more precise, offering detailed feedback, but it often becomes unstable and its performance drops off over longer training periods. The instability in SDPO stems from two key issues: it spends time correcting things that are already correct, creating confusion for the model, and the 'teacher' signal it uses to guide learning becomes less reliable as training goes on.

What's the solution?

The researchers introduce a new method called SRPO, which cleverly combines the strengths of both GRPO and SDPO. SRPO intelligently routes different types of examples to different learning processes. Correct responses are handled by GRPO for overall reward-focused improvement, while incorrect responses are sent to SDPO for detailed, token-level correction. Additionally, SRPO includes a system to prioritize learning from the most reliable 'teacher' signals, ignoring those that are uncertain or noisy, using a weighting system based on how confident the model is.

Why it matters?

SRPO represents a significant advancement in fine-tuning large language models because it achieves both fast initial improvement *and* long-term stability, something previous methods struggled with. It consistently outperforms both GRPO and SDPO on several tests, leading to better overall performance, more reasonable response lengths, and even reduces the computational cost of training.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

View Paper