DMax: Aggressive Parallel Decoding for dLLMs
Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
2026-04-10
Summary
This paper introduces DMax, a new way to make diffusion language models, which are a type of AI that generates text, work much faster without losing quality.
What's the problem?
When these language models generate text quickly by working on different parts of the text at the same time (parallel decoding), errors can build up and ruin the final result. Existing methods try to fix this, but often slow down the process or don't fully address the error issue. They typically decode by switching abruptly from a masked state to a final token, which isn't ideal for refinement.
What's the solution?
DMax tackles this by changing *how* the model decodes. Instead of a sudden switch, it treats decoding as a gradual improvement, starting from a 'mask' representing uncertainty and slowly refining it into actual words. They also developed a new training method called 'On-Policy Uniform Training' which teaches the model to recover correct words even if it makes mistakes during the process. Finally, they use 'Soft Parallel Decoding' where each step is a blend between the predicted word and the initial uncertainty, allowing the model to constantly revise its work in a smoother way.
Why it matters?
This research is important because it significantly speeds up text generation with these powerful AI models. The paper shows improvements in how quickly the model can solve math problems and coding challenges, while maintaining the same level of accuracy. This means we can get results faster, making these models more practical for real-world applications, and they achieved a very high speed of over 1,300 text pieces generated per second on powerful hardware.
Abstract
We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax