Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, Shiwei Liu
2025-08-28
Summary
This paper focuses on making diffusion language models, a newer type of AI for generating text, work faster. These models are good at creating text in different orders and all at once, but they're currently slower than the more common 'autoregressive' models.
What's the problem?
Diffusion language models take a lot of steps to refine their answers, and each step requires looking at the entire text again, which is computationally expensive. This makes generating high-quality text slow. The researchers noticed that, surprisingly, these models often 'know' the correct answer much earlier in the process – sometimes halfway through the refinement steps – but still continue to refine it unnecessarily.
What's the solution?
The researchers developed a method called 'Prophet' that doesn't require any extra training. Prophet monitors how confident the model is in its top two possible answers. If the difference in confidence is large enough, it stops the refinement process early and 'commits' to the most likely answer, decoding all remaining tokens at once. It's like saying, 'Okay, we're pretty sure about this, let's finish it!'
Why it matters?
Prophet significantly speeds up diffusion language models – up to 3.4 times faster in their tests – without sacrificing the quality of the generated text. This is important because it makes these promising models more practical for real-world applications and shows that focusing on *when* to stop generating text, rather than just *how* to generate it, is a valuable approach to improving speed.
Abstract
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.