Introspective Diffusion Language Models

Yifan Yu, Yuqing Jian, Junxiong Wang, Zhongzhu Zhou, Donglin Zhuang, Xinyu Fang, Sri Yanamandra, Xiaoxia Wu, Qingyang Wu, Shuaiwen Leon Song, Tri Dao, Ben Athiwaratkun, James Zou, Fan Lai, Chenfeng Xu

2026-04-14

Summary

This paper addresses the issue of diffusion language models not performing as well as traditional autoregressive models, even though diffusion models have the potential to generate text much faster.

What's the problem?

The core problem is a lack of 'introspective consistency' in diffusion language models. Autoregressive models essentially 'agree' with what they've already written as they build a sentence, while diffusion models often contradict themselves. This means a diffusion model might generate a phrase, then later add something that doesn't quite fit with what it already said. The paper identifies that this inconsistency stems from how diffusion and autoregressive models are trained; autoregressive training naturally encourages this self-agreement.

What's the solution?

The researchers introduce a new type of diffusion language model called I-DLM, which stands for Introspective Diffusion Language Model. I-DLM uses a clever technique called 'introspective strided decoding' that allows the model to check its previous work *while* it's generating new text, all in one step. This forces the model to be more consistent with itself, mimicking the behavior of autoregressive models. They also optimized the system to make it run efficiently, especially when handling many requests at once.

Why it matters?

This work is important because it closes the quality gap between diffusion and autoregressive language models. I-DLM achieves performance comparable to autoregressive models, but with the speed benefits of diffusion. It also significantly improves the efficiency of serving these models, meaning they can handle more users simultaneously. This is a big step towards making faster, high-quality language models more practical for real-world applications.

Abstract

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.

View Paper