SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
2024-11-08

Summary
This paper discusses SVDQuant, a new method designed to improve the efficiency of diffusion models by reducing their memory usage and speeding up processing through a special 4-bit quantization technique.
What's the problem?
Diffusion models are great for generating high-quality images, but as they get larger, they require more memory and take longer to run. This makes them difficult to use in real-time applications, which is a problem for developers who want to deploy these models on devices with limited resources.
What's the solution?
SVDQuant addresses this issue by quantizing the model's weights and activations down to 4 bits. Instead of just smoothing out errors (which traditional methods do), SVDQuant absorbs outliers (unusual data points that can cause problems) using a low-rank component. This means it shifts problematic data from one part of the model to another and uses advanced techniques like Singular Value Decomposition (SVD) to manage these outliers effectively. Additionally, the authors created an inference engine called Nunchaku that helps streamline the process, reducing unnecessary memory access and speeding things up.
Why it matters?
This research is significant because it allows larger diffusion models to run more efficiently without losing quality in the images they generate. By reducing memory usage by 3.5 times and increasing speed by 3 times compared to previous methods, SVDQuant makes it easier to use these advanced models in practical applications, such as interactive content creation on personal computers.
Abstract
Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"{\i}vely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-Sigma, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5times, achieving 3.0times speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.