< Explain other AI papers

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, Zhijian Liu

2026-02-06

DFlash: Block Diffusion for Flash Speculative Decoding

Summary

This paper introduces DFlash, a new way to speed up large language models (LLMs) by making them generate text faster. It focuses on improving a technique called 'speculative decoding' which tries to predict what the LLM will say and then quickly checks if the prediction is correct.

What's the problem?

Large language models are really good at tasks like writing and answering questions, but they take a long time to generate text because they do it one word at a time. This sequential process slows things down and doesn't fully utilize the power of computer hardware like GPUs. Existing methods to speed things up still have this sequential bottleneck, and alternative approaches like using 'diffusion' models haven't been as accurate as the standard models.

What's the solution?

The researchers developed DFlash, which uses a faster, simpler type of model called a 'block diffusion model' to quickly draft potential text. This draft is then checked by the main, more powerful LLM. Importantly, DFlash generates multiple draft words *at the same time* instead of one by one, and it uses information from the main LLM to make the drafts more accurate. This leads to fewer drafts being rejected and a faster overall process.

Why it matters?

DFlash significantly speeds up LLMs – over six times faster in their tests – without losing any accuracy. It’s also faster than the best existing methods for speculative decoding, meaning it can make these powerful AI models more practical for real-world applications where speed is crucial, like chatbots or content creation.

Abstract

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.