BitNet Distillation

Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei

2025-10-17

Summary

This paper introduces a new method called BitDistill, which makes large language models (LLMs) much smaller and faster without significantly sacrificing their performance on specific tasks.

What's the problem?

Large language models are incredibly powerful, but they require a lot of computing power and memory to run, making them difficult to use on devices with limited resources like phones or older computers. Simply shrinking these models often leads to a big drop in how well they perform on tasks they were designed for.

What's the solution?

BitDistill tackles this problem by converting the LLM's internal numbers, usually stored with high precision, into a simplified format using only three possible values: -1, 0, and 1. This is like going from a detailed color palette to just three basic colors. To make this work well, the researchers used three key ideas: a special module called SubLN, a technique to focus on the most important parts of the model's attention mechanism, and a 'warm-up' step where the model is pre-trained to help it adjust to this new format.

Why it matters?

BitDistill is important because it allows us to run powerful language models on less powerful hardware, saving memory and speeding up processing. The researchers showed that their method achieves performance similar to the original, full-sized models, while using up to ten times less memory and running up to 2.65 times faster on standard computer processors. This opens up possibilities for using LLMs in more places and for more people.

Abstract

In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.

View Paper