Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding

2026-04-15

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Summary

This research dives into how 'on-policy distillation' works when trying to make smaller language models learn from larger ones, a technique called OPD. It tries to understand *why* OPD sometimes works really well and other times fails.

What's the problem?

While OPD is used to improve smaller language models, it wasn't clear *what* specifically made it successful or unsuccessful. Researchers noticed that simply copying a bigger model's answers wasn't always enough, and they needed to figure out the key ingredients for effective learning. The core issue was a lack of understanding of the underlying mechanisms driving OPD's performance, leading to unpredictable results.

What's the solution?

The researchers found that OPD works best when the smaller 'student' model and the larger 'teacher' model think in similar ways, but the teacher also needs to actually *teach* the student something new – capabilities the student didn't already have. They also discovered that successful OPD focuses on getting the student to agree with the teacher on the most likely words to use in a given situation. If OPD isn't working, they suggest starting with some 'off-policy' data or carefully choosing prompts that align with the teacher's strengths. They also investigated what happens at the level of individual words (tokens) to understand how the learning process unfolds.

Why it matters?

Understanding OPD is important because it's a popular way to make large language models more efficient and accessible. By figuring out what makes it work, we can improve the process and make smaller models perform better, potentially reducing the computational cost of using these powerful AI tools. However, the research also raises questions about whether this technique will continue to be effective as we try to distill knowledge over longer and more complex tasks.

Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

View Paper