Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

2025-12-11

Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Summary

This paper introduces a new method, called SchED, to speed up diffusion large language models (dLLMs) without significantly sacrificing the quality of their output.

What's the problem?

Diffusion large language models are powerful but slow because they generate text step-by-step, taking a long time to finish a response. This makes them less practical for real-world applications where quick answers are needed. Existing methods to speed them up often reduce the quality of the generated text, especially for longer responses.

What's the solution?

SchED works by monitoring how confident the model is in its predictions at each step. Instead of waiting for the model to finish generating the entire response, SchED stops the process early when the model’s confidence reaches a certain level and is consistently improving. It does this by looking at the differences in the model’s scores for different possible words and doesn’t require any additional training – it can be applied to existing models. The researchers tested it on different types of dLLMs and various tasks like answering questions, solving math problems, and translating languages.

Why it matters?

SchED is important because it makes dLLMs much faster – up to four times faster in some cases – while maintaining almost all of the original quality. This improvement makes these powerful models more usable for everyday tasks and opens up possibilities for applications where speed is critical. It also outperforms previous attempts to speed up these models, particularly when generating longer pieces of text.

Abstract

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves 3.8-4.0times speedups while retaining 99.8-100% of the baseline score on average. On base models, SchED yields consistent speedup gains with 99.1-100% performance retention, with up to 2.34times under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, γ{=}4), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.

View Paper