Hybrid Policy Distillation for LLMs

Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu

2026-04-24

Summary

This paper focuses on a technique called knowledge distillation, which is a way to make really large language models smaller and more efficient without losing too much of their ability to perform tasks. The researchers looked at how different approaches to knowledge distillation work and came up with a new, improved method.

What's the problem?

Large language models are incredibly powerful, but they require a lot of computing power and memory. This makes them difficult to use on devices with limited resources, like phones or even some computers. Existing methods for shrinking these models, called knowledge distillation, often involve trade-offs – they might be good at covering many possibilities but not very precise, or vice versa. Also, they can be unstable during the training process and require a lot of data.

What's the solution?

The researchers realized that many knowledge distillation methods are actually just different ways of adjusting how the smaller 'student' model learns from the larger 'teacher' model. They proposed a new method called Hybrid Policy Distillation (HPD) that combines the strengths of two different learning approaches – one that focuses on covering all possible answers and another that focuses on finding the most likely answer. They also figured out a way to use existing data more efficiently and add a small amount of new data generated during training to help the student model learn better. Essentially, they created a more balanced and stable way to distill knowledge.

Why it matters?

This work is important because it provides a better way to compress large language models, making them more accessible and usable in a wider range of applications. The new method, HPD, is more stable, faster to train, and achieves better performance than existing methods across different types of tasks, like solving math problems, having conversations, and writing code. This means we can potentially have powerful AI tools available on more devices and for more people.

Abstract

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.

View Paper