Distillation Scaling Laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

2025-02-13

Summary

This paper talks about how to make AI models learn more efficiently through a process called distillation, where a smaller 'student' model learns from a larger 'teacher' model. The researchers found a way to predict how well this process will work based on how much computing power is used.

What's the problem?

Training large AI models takes a lot of time and computing power. Distillation is a way to create smaller, more efficient models, but it's been hard to know exactly how to split resources between training the teacher and student models to get the best results. This uncertainty has made it risky to use distillation for very large AI projects.

What's the solution?

The researchers created a 'distillation scaling law' that helps predict how well a student model will perform based on how much computing power is used for both the teacher and student. They used this to figure out the best ways to allocate resources in different situations, like when you already have a teacher model or when you need to train one from scratch. They also found out when it's better to use distillation and when it's better to train models the traditional way.

Why it matters?

This matters because it makes it easier and less risky to create more efficient AI models. By knowing exactly how to allocate resources, researchers and companies can create better AI models with less waste of time and computing power. This could lead to more advanced AI being available on a wider range of devices, from smartphones to large-scale applications, and could speed up AI research and development overall.

Abstract

We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.

View Paper