Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu
2024-11-27
Summary
This paper discusses how low-bit quantization affects the performance of large language models (LLMs) and reveals that undertrained models perform better with this technique than well-trained ones.
What's the problem?
As LLMs become larger, they require a lot of computational resources, which can slow them down and make them harder to use. Low-bit quantization is a method that reduces the amount of data these models need to process, but it can also lead to performance issues. The problem is that it's unclear how different training levels and model sizes affect the performance when using low-bit quantization.
What's the solution?
The authors studied over 1,500 quantized LLMs of various sizes and training levels to understand how low-bit quantization impacts their performance. They found that larger models or those with fewer training tokens experience less degradation in performance when quantized compared to smaller, well-trained models. They proposed scaling laws that help predict how many training tokens are needed for different model sizes and how well they will perform under low-bit quantization.
Why it matters?
This research is important because it provides insights into optimizing LLMs for better performance while using less computational power. By understanding how training levels affect quantization, researchers can develop more efficient AI models that maintain high quality in their outputs, making them more practical for real-world applications.
Abstract
We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.