Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie

2025-05-30

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV
Cache and Parallel Decoding

Summary

This paper talks about Fast-dLLM, a new way to make diffusion-based large language models much faster when generating text, without needing to retrain them or lose much quality in their answers.

What's the problem?

The problem is that diffusion language models, which can generate text in parallel instead of one word at a time, are often slower in practice because they can't reuse their previous work efficiently and can make mistakes when trying to generate several words at once.

What's the solution?

The researchers created two new strategies: one is a special memory system called block-wise KV Cache that lets the model reuse information from earlier steps, and the other is a smart way to decide which words to generate together, only picking the ones the model is very confident about. This combination makes the model much faster at answering questions or writing text, while still keeping its answers accurate.

Why it matters?

This is important because it means AI models can generate text much more quickly and efficiently, making them more practical for real-world use in things like chatbots, writing assistants, and coding tools, all without needing extra training or sacrificing quality.

Abstract

A novel block-wise approximate KV Cache and confidence-aware parallel decoding strategy improve the inference speed of diffusion-based large language models without significant quality loss.

View Paper