FlatQuant: Flatness Matters for LLM Quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

2024-10-18

FlatQuant: Flatness Matters for LLM Quantization

Summary

This paper presents FlatQuant, a new method for improving the efficiency and accuracy of large language models (LLMs) by optimizing how their parameters are compressed and processed.

What's the problem?

As large language models become more complex, they need to be compressed to run faster and use less memory. This process, known as quantization, can lead to errors if the model's parameters (weights and activations) are not properly adjusted. Traditional methods often struggle with outliers—values that are much larger or smaller than the rest—which can cause inaccuracies when compressing the model.

What's the solution?

To address these issues, the authors developed FlatQuant, which focuses on making the weights and activations of LLMs 'flatter.' This means adjusting them so that they are more evenly distributed, which helps reduce errors during quantization. FlatQuant uses a technique called 'affine transformation' tailored to each part of the model, allowing for better performance without requiring extensive additional processing time. Their experiments showed that FlatQuant achieves less than a 1% drop in accuracy while significantly improving speed during model operation.

Why it matters?

This research is important because it enhances how AI models can be deployed in real-world applications by making them faster and more efficient. With FlatQuant, developers can create LLMs that maintain high accuracy while being easier to run on devices with limited resources. This could lead to broader use of advanced AI technologies in various fields, such as healthcare, education, and mobile applications.

Abstract

Recently, quantization has been widely used for the compression and acceleration of large language models~(LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with the equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still remain steep and outspread. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments show that FlatQuant sets up a new state-of-the-art quantization benchmark. For instance, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely 0.07x, bringing up to 2.3x speedup for prefill and 1.7x speedup for decoding, respectively. Code is available at: https://github.com/ruikangliu/FlatQuant.

View Paper