Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun

2025-08-21

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Summary

This research explores how to make newer types of large language models, called diffusion large language models (dLLMs), smaller and more efficient so they can run on everyday devices. It's the first study to systematically look at techniques for shrinking these models, focusing on how to handle unusual spikes in their internal workings that make shrinking difficult.

What's the problem?

Newer, powerful language models that generate text using a 'diffusion' process are too big and require too many resources to run on devices like smartphones or laptops. A common way to shrink older types of language models, called post-training quantization, hasn't been thoroughly tested for these diffusion models, and there's a specific issue with 'activation outliers' – very large internal values – that makes shrinking them tricky.

What's the solution?

The paper investigates how to shrink diffusion language models by examining different ways to reduce their size, considering how many bits of information are used, different shrinking methods, various tasks the models perform, and different types of models. They identified 'activation outliers' as a major hurdle for low-bit quantization and propose this study as a way to understand how to overcome this.

Why it matters?

This work is important because it lays the groundwork for making advanced language models, which are currently only accessible on powerful computers, usable on the devices we use every day. By figuring out how to shrink these dLLMs efficiently, the research could lead to more accessible AI tools and applications for everyone.

Abstract

Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.

View Paper