Compact Language Models via Pruning and Knowledge Distillation
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
2024-07-23

Summary
This paper discusses a new method for making large language models (LLMs) smaller and more efficient through techniques called pruning and knowledge distillation. The goal is to reduce the amount of data and computing power needed while maintaining performance.
What's the problem?
Training large language models from scratch can be very resource-intensive, requiring a lot of computing power and time. This makes it difficult to create different sizes of models for various applications. Additionally, existing methods for compressing these models often involve retraining them completely, which is not efficient.
What's the solution?
The authors propose a method that involves pruning existing LLMs and then retraining them using only a small fraction (less than 3%) of the original training data. They develop a set of best practices for compressing models by adjusting their structure (depth, width, attention mechanisms) and using knowledge distillation to retain important information. Their approach allows them to create smaller models (like 8B and 4B) from a larger pretrained model (15B) while needing significantly less training data and reducing costs by about 1.8 times. Their new models, called Minitron, perform well on various tasks and even outperform some state-of-the-art compression techniques.
Why it matters?
This research is important because it makes it easier and cheaper to develop smaller language models that can still perform well. This can lead to more accessible AI technology that can be used in devices with limited resources, such as smartphones or embedded systems, making advanced AI capabilities available to more people.
Abstract
Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.