< Explain other AI papers

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang

2024-09-30

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Summary

This paper introduces a new method called Vector Post-Training Quantization (VPTQ) that helps reduce the size of Large Language Models (LLMs) to extremely low levels, like 2 bits, while still keeping them accurate and efficient.

What's the problem?

Large Language Models are very big and require a lot of memory and storage, making them hard to use in many applications. Traditional methods for reducing their size often don't work well at very low levels of quantization, which can lead to poor performance and high costs.

What's the solution?

The authors developed VPTQ, which uses advanced techniques like Second-Order Optimization to improve how model weights are compressed. By using Vector Quantization, they can turn complex data into simpler forms that take up less space. Their experiments showed that VPTQ can significantly reduce the size of models while also improving their accuracy compared to older methods.

Why it matters?

This research is important because it allows for the deployment of large language models in environments with limited resources. By making these models smaller and faster without losing accuracy, VPTQ opens up new possibilities for using AI in everyday applications, such as on mobile devices or in real-time systems.

Abstract

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.