< Explain other AI papers

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin

2026-02-13

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Summary

This paper explores a technique called on-policy distillation, which is a way to train a smaller 'student' AI model to mimic a larger, more capable 'teacher' model. It builds upon existing methods and proposes improvements to make the student even better, sometimes even surpassing the teacher's abilities.

What's the problem?

Currently, on-policy distillation works well, but the reasons *why* it works so well aren't fully understood. Also, standard on-policy distillation has limitations in how much it can leverage the teacher's knowledge, especially when combining knowledge from multiple specialized teachers or when the teacher is much larger than the student. The original method doesn't give enough control over balancing learning from the teacher's actions versus learning from rewards.

What's the solution?

The researchers first showed that on-policy distillation is actually a specific type of reinforcement learning. Then, they created a more flexible version called Generalized On-Policy Distillation (G-OPD). G-OPD allows you to adjust how much the student focuses on matching the teacher's behavior versus maximizing rewards. They found that increasing the importance of rewards (called 'reward extrapolation' or ExOPD) consistently improved performance. They also discovered that using the teacher's original, pre-trained model as a reference point for rewards helps the student learn more accurately, especially when the teacher is much larger.

Why it matters?

This work provides a better theoretical understanding of why on-policy distillation is effective and offers practical improvements that can lead to better AI models. The ability to surpass the teacher's performance is particularly exciting, as it opens up possibilities for creating AI systems that combine expertise from different sources. These improvements are valuable for complex tasks like solving math problems and generating computer code, where having a strong AI assistant is crucial.

Abstract

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.