WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

Jiale Chen, Vage Egiazarian, Torsten Hoefler, Dan Alistarh

2025-12-03

WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

Summary

This paper focuses on making large language models smaller and faster by using a technique called quantization, which reduces the precision of the numbers used to represent the model's data. It introduces a new method to improve this process, leading to better performance.

What's the problem?

When you try to simplify the numbers in a language model (quantize them), a few very large or very small values can mess things up. These extreme values require a wider range of numbers to represent, which defeats the purpose of quantization and reduces its effectiveness. A common fix is to use mathematical transformations before quantization, but existing transformations don't take into account the specific data within the model, and it's not clear if they're the best possible approach.

What's the solution?

The researchers developed a new way to transform the data *before* quantization that is specifically tailored to the data itself. They figured out the mathematically best transformation to use, considering both the weights and activations of the model. This transformation, called WUSH, builds upon a standard transformation (Hadamard) but adds a component that adapts to the data's characteristics, like its average and spread. This makes it more effective than simply using a standard transformation.

Why it matters?

This work is important because it allows for more effective quantization of large language models. By improving quantization, we can make these models smaller, faster, and more energy-efficient, making them easier to deploy on devices with limited resources, like phones or laptops, without significantly sacrificing performance.

Abstract

Quantization to low bitwidth is a standard approach for deploying large language models, however, a few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer. A common mitigation approach is to apply some fixed orthogonal transforms, such as Hadamard matrices, before quantization, which typically reduces the dynamic range. Yet, these transforms ignore the statistics of the data, and their optimality is currently not understood. In this work, we derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization using standard data-free quantizers for common numerical formats. Specifically, we provide derivations of the optimal adaptive (data-aware) transforms for round-to-nearest (RTN), AbsMax-scaled block quantizers for both integer and floating-point formats. The resulting construction, which we call WUSH, combines a Hadamard backbone with a data-dependent component based on second-order moments, yielding a non-orthogonal transform that is provably optimal under mild assumptions and remains structured for efficient implementation. Preliminary experimental results show that our approach consistently improves upon the Hadamard transform for common formats.

View Paper