KLASS: KL-Guided Fast Inference in Masked Diffusion Models

Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun

2025-11-12

KLASS: KL-Guided Fast Inference in Masked Diffusion Models

Summary

This paper introduces a new way to speed up the process of generating text, images, or even molecules using a type of artificial intelligence called masked diffusion models.

What's the problem?

Masked diffusion models are really good at creating things, but they do it slowly because they build up the final result step-by-step, refining it over and over. This iterative process makes generating content take a long time, which is a major drawback.

What's the solution?

The researchers developed a technique called KL-Adaptive Stability Sampling, or KLASS. Essentially, KLASS figures out which parts of the generated content the model is already very confident about. Instead of refining everything slowly, it quickly reveals those confident parts, allowing for much faster generation without sacrificing quality. It does this by looking at something called 'KL divergence' at each step to measure confidence, and then unmasking multiple parts at once without needing to retrain the model.

Why it matters?

This is important because it makes these powerful AI models much more practical to use. By significantly speeding up the generation process – up to 2.78 times faster in some cases – KLASS allows for quicker results in areas like reasoning, text creation, image generation, and even designing new molecules. It’s a broadly useful improvement that works across different types of models and tasks.

Abstract

Masked diffusion models have demonstrated competitive results on various tasks including language generation. However, due to its iterative refinement process, the inference is often bottlenecked by slow and static sampling speed. To overcome this problem, we introduce `KL-Adaptive Stability Sampling' (KLASS), a fast yet effective sampling method that exploits token-level KL divergence to identify stable, high-confidence predictions. By unmasking multiple tokens in each iteration without any additional model training, our approach speeds up generation significantly while maintaining sample quality. On reasoning benchmarks, KLASS achieves up to 2.78times wall-clock speedups while improving performance over standard greedy decoding, attaining state-of-the-art results among diffusion-based samplers. We further validate KLASS across diverse domains, including text, image, and molecular generation, showing its effectiveness as a broadly applicable sampler across different models.

View Paper