EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

2024-07-17

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Summary

This paper presents EfficientQAT, a new method for training large language models (LLMs) that helps reduce their memory usage while maintaining high performance.

What's the problem?

Large language models are essential for many AI applications, but they require a lot of memory and computational power, making them difficult to use effectively. Existing methods to reduce memory consumption, like quantization-aware training (QAT), often take a long time to train and can lead to a loss in accuracy.

What's the solution?

EfficientQAT introduces a two-step process to improve the efficiency of quantization. First, it uses Block-wise training (Block-AP) to train all parameters in smaller sections of the model, which helps maintain performance while reducing memory needs. Then, it performs end-to-end training for just the quantization parameters (E2E-QP) to fine-tune the model without retraining everything. This approach allows for faster training times and less memory usage while still achieving high accuracy.

Why it matters?

This research is important because it makes it easier to use large language models in various applications by reducing their resource requirements. EfficientQAT allows for better performance in tasks that require quick responses and high accuracy, which is crucial in fields like natural language processing and artificial intelligence. By improving how these models are trained and deployed, it can help advance technology in many areas.

Abstract

Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize model weights and quantization parameters. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). Block-AP sequentially conducts quantization-aware training for all parameters in each transformer block with block-wise reconstruction, maintaining efficiency by avoiding training the entire LLM. Initialized with quantized model, E2E-QP then trains only quantization parameters (step sizes) end-to-end, enhancing efficiency with a fixed quantized backbone and reduced trainable parameter count. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3\% accuracy degradation compared to the full precision (69.48 vs. 72.41). Notably, this INT2 quantized 70B model obtains a 1.67 accuracy gain over the Llama-2-13B model (69.48 vs. 67.81) while requiring less memory (19.2GB vs. 24.2GB). Code is available at https://github.com/OpenGVLab/EfficientQAT.

View Paper