Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

2025-12-17

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Summary

This paper explores a way to make diffusion language models, which are fast at generating text, as good at understanding and completing tasks as the more traditional autoregressive language models, but without sacrificing speed.

What's the problem?

Diffusion language models can generate text really quickly because they don't have to build it up word-by-word like older models do. However, when you train them from scratch, they don't perform as well on complex language tasks compared to those older, autoregressive models. The core issue is how to take a powerful, already-trained autoregressive model and efficiently convert it into a fast diffusion model without losing its ability to accurately understand and generate text.

What's the solution?

The researchers found that the key to a good conversion is preserving how the original autoregressive model 'pays attention' to different parts of the text. They developed a new training method that allows the diffusion model to look at text in a more flexible way – considering words within blocks bidirectionally, but maintaining the original order between blocks. They also changed how words are hidden during training to better match how the model will be used when it's actually generating text. This combination helps the diffusion model learn more effectively from the original autoregressive model.

Why it matters?

This work is important because it provides a practical way to get the best of both worlds: the speed of diffusion language models and the accuracy of autoregressive models. The resulting models, called Efficient-DLMs, are faster and more accurate than existing models, meaning they can handle language tasks more efficiently and effectively. This could lead to improvements in things like chatbots, translation tools, and content creation.

Abstract

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

View Paper