INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo
2025-11-03
Summary
This paper compares using different ways to represent numbers in AI, specifically focusing on 'floating-point' (FP) and 'integer' (INT) formats, when dealing with large language models like the ones powering chatbots. It looks at how these choices affect both how accurately the AI works and how efficiently it runs on hardware.
What's the problem?
AI models are getting bigger and need faster hardware. A common trick to speed things up is to use less precise numbers – like rounding things off. While the tech industry is leaning towards using low-precision floating-point numbers, no one had really done a head-to-head comparison to see if using integers instead might be better, especially when looking at different levels of detail in how the numbers are rounded. This lack of information makes it hard to design both the AI programs and the hardware they run on effectively.
What's the solution?
The researchers systematically tested both floating-point and integer formats at different levels of detail, from coarse (big chunks) to fine (small blocks). They found that floating-point works better when dealing with large groups of numbers, but integers can be superior when you break things down into smaller blocks. They specifically showed that a particular 8-bit integer format (MXINT8) is often better than its floating-point equivalent. They also improved integer training by fixing a problem with how gradients are calculated, making it almost as good as floating-point training. Finally, they demonstrated that a specific integer format (NVINT4) can even outperform a floating-point format (NVFP4) with the right techniques.
Why it matters?
This research challenges the idea that floating-point is always the best way to go for AI. It suggests that using fine-grained integer formats, particularly MXINT8, can provide a better balance of accuracy, speed, and power efficiency for future AI hardware. This could lead to the development of AI chips that are faster and use less energy, which is crucial as AI models continue to grow in size and complexity.
Abstract
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.