Matryoshka Quantization

Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati

2025-02-11

Summary

This paper talks about Matryoshka Quantization (MatQuant), a new method that allows AI models to work efficiently at different levels of precision, reducing the need for multiple models and improving performance, especially for low-precision versions.

What's the problem?

When AI models are simplified to use fewer bits (like int4 or int2) to save memory and computing power, they often lose accuracy. To deal with this, developers usually have to create and maintain separate models for each level of precision, which is inefficient and costly.

What's the solution?

The researchers developed MatQuant, which uses a 'nested' structure where lower-precision versions of a model are built into higher-precision ones. This means only one model needs to be trained, and it can be used at various precision levels without losing much accuracy. They also introduced techniques like co-training and co-distillation to make low-precision models, like int2, up to 10% more accurate than older methods.

Why it matters?

This matters because it makes AI models more efficient and accessible by reducing the need for multiple versions while improving performance at lower precisions. This approach could save time, storage, and computing resources, making AI more practical for use on devices with limited power or memory.

Abstract

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.

View Paper