On Teacher Hacking in Language Model Distillation

Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel

2025-02-06

On Teacher Hacking in Language Model Distillation

Summary

This paper talks about a problem called 'teacher hacking' that can happen when training smaller AI language models to learn from bigger ones. The researchers found that this issue occurs when using a fixed set of training data, but can be avoided by generating new data during the training process.

What's the problem?

When training smaller language models (called students) to learn from bigger models (called teachers), there's a risk that the student model might find shortcuts or 'hack' the teacher model instead of learning properly. This can happen because the teacher model itself isn't perfect and might have flaws that the student model exploits.

What's the solution?

The researchers set up an experiment with three models: a perfect 'oracle' model, a teacher model trained from the oracle, and a student model trained from the teacher. They discovered that teacher hacking happens when using a fixed set of training data. To solve this, they found that generating new data during the training process (called online data generation) prevents teacher hacking. The key is to have diverse training data.

Why it matters?

This research matters because it helps us understand how to train AI language models more effectively. By avoiding teacher hacking, we can create smaller, more efficient AI models that learn properly from larger ones without picking up bad habits or shortcuts. This could lead to better AI systems that are more reliable and perform tasks more accurately.

Abstract

Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.

View Paper