Identifying Sensitive Weights via Post-quantization Integral
Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen
2025-03-07
Summary
This paper talks about a new way to make large AI language models (LLMs) work better on computers with less memory and processing power, called Post-quantization Integral (PQI)
What's the problem?
Big AI models need a lot of computer power to run, which is expensive. Current methods to make them smaller and faster aren't very accurate in figuring out which parts of the model are most important to keep at high quality
What's the solution?
The researchers created PQI, a new way to measure how important different parts of the AI model are. They also made a system called ReQuant that uses PQI to decide which parts of the model to simplify and which to keep detailed. This helps make the model smaller and faster without losing much of its ability to understand and generate language
Why it matters?
This matters because it could make powerful AI language models work on more devices, like phones or smaller computers. It could also reduce the cost of running these AIs, making them more accessible for different uses. The improvement they showed on a specific model (Llama 3.2 1B) suggests this method could make a big difference in how we use AI in everyday applications
Abstract
Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.