Fast-dLLM v2: Efficient Block-Diffusion LLM
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
2025-10-08
Summary
This paper introduces Fast-dLLM v2, a new way to make large language models generate text much faster without losing quality.
What's the problem?
Large language models are really good at tasks like writing and translating, but they generate text one word at a time, which is slow. This sequential process limits how quickly they can respond, making them inefficient for real-time applications.
What's the solution?
The researchers took existing, well-trained language models and adapted them using a technique called 'block diffusion.' Instead of generating text word-by-word, Fast-dLLM v2 generates text in blocks, allowing for parallel processing and a significant speed boost. They also developed clever caching methods to remember previous parts of the text, further speeding things up. Importantly, they achieved this with a relatively small amount of additional training data compared to other similar approaches.
Why it matters?
This work is important because it makes powerful language models much more practical to use. By dramatically increasing the speed of text generation while maintaining accuracy, Fast-dLLM v2 brings us closer to being able to deploy these models in applications where quick responses are crucial, like chatbots or real-time content creation.
Abstract
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.