SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen

2025-12-05

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Summary

This paper introduces a new method, called SignRoundV2, for shrinking the size of large language models (LLMs) so they can run more efficiently without losing too much accuracy.

What's the problem?

Large language models are powerful, but they require a lot of computing power and memory. To make them usable on more devices, researchers try to 'quantize' them, meaning they reduce the number of bits used to represent the model's numbers. However, drastically reducing the bits – down to 2 or 4 – often causes a significant drop in the model's performance, making it less accurate. Existing methods struggle to maintain accuracy at these extremely low bit levels.

What's the solution?

SignRoundV2 tackles this problem in two main ways. First, it quickly figures out which parts of the model are most sensitive to quantization errors, and allocates more bits to those parts. It does this by looking at how changes in the model's weights affect the output, combined with how much the quantization process changes those weights. Second, it fine-tunes the scaling factors used during quantization to further improve accuracy. This process doesn't require using different numbers of bits for different parts of the model, keeping things simple.

Why it matters?

This research is important because it allows for much more efficient deployment of LLMs. By achieving good performance even with very low-bit quantization, models can run on devices with limited resources, like phones or laptops, and it reduces the cost of running these models in data centers. The results show that SignRoundV2 can get close to the accuracy of the original, full-precision models, even at 2 bits, which is a significant step forward.

Abstract

Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.

View Paper