< Explain other AI papers

Learning Unmasking Policies for Diffusion Language Models

Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof, Joao Monterio, Victor Turrisi, Jason Ramapuram, Marco Cuturi

2025-12-11

Learning Unmasking Policies for Diffusion Language Models

Summary

This paper explores how to make diffusion language models, which are a newer type of AI for generating text, work better and faster. These models are becoming competitive with more traditional text generators, and this research focuses on improving how they create text step-by-step.

What's the problem?

Diffusion language models work by starting with a bunch of 'mask' tokens and gradually replacing them with real words. A key challenge is deciding *which* mask tokens to replace at each step. Simply picking tokens randomly isn't very effective. Existing methods use rules of thumb, like focusing on tokens the model is most confident about, but these rules need to be manually adjusted and don't work well when the amount of masked text changes. Essentially, it's hard to automatically figure out the best way to 'unmask' the text during generation.

What's the solution?

The researchers used reinforcement learning – a technique where an AI learns by trial and error – to train a system that decides which tokens to unmask. They treated the language model itself as the 'environment' and created a small 'policy' (a simple, single-layer transformer) that learns to look at the model's confidence in different words and then choose which masked tokens to replace. This policy learns to make smart unmasking decisions without needing someone to manually set rules.

Why it matters?

This research is important because it offers a way to automatically optimize the text generation process in diffusion language models. The trained 'unmasking' policies perform as well as, or even better than, existing methods, and they can adapt to different models and longer texts. This could lead to faster and higher-quality text generation, making these models more practical for real-world applications, though it does have some limitations when dealing with very different types of text than it was trained on.

Abstract

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach.