< Explain other AI papers

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

2025-10-06

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Summary

This paper investigates new, very low-precision number formats – specifically 4-bit floating point numbers like MXFP4 and NVFP4 – that are designed to speed up the process of using large language models (LLMs). It finds that while these formats *can* be faster, they don't automatically work well and require special techniques to achieve good results.

What's the problem?

The idea behind MXFP4 and NVFP4 is to represent numbers with fewer bits, making calculations faster, especially on modern GPUs. However, the study found that existing methods for simplifying models (called quantization) don't work well with these formats. NVFP4's design makes it hard to handle unusual values, and MXFP4's method of scaling numbers introduces too much error, leading to a loss of accuracy. Essentially, the promise of speed isn't realized because the models become less accurate.

What's the solution?

To fix this, the researchers developed a new quantization technique called Micro-Rotated-GPTQ (MR-GPTQ). This method is specifically designed for these 4-bit floating point formats. It uses a mathematical trick called a Hadamard transform to rearrange the numbers and optimize the quantization process for each format. They also created efficient code to run this new method on GPUs without slowing things down much. This code cleverly combines steps to make the calculations faster.

Why it matters?

This work shows that simply switching to a lower precision format isn't enough to get a performance boost. Instead, you need to adapt the techniques used to simplify the model to the specific characteristics of the new format. MR-GPTQ demonstrates that with format-specific optimizations, these 4-bit formats can achieve significant speedups – up to 3.6x faster per layer and 2.2x faster overall on some GPUs – while maintaining or even improving accuracy, opening up possibilities for faster and more efficient LLM use.

Abstract

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.