From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs
Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, Yunhe Wang
2025-12-10
Summary
This paper introduces a new method for efficiently creating Diffusion Language Models (DLMs) by adapting existing, powerful Large Language Models (LLMs). It's about making DLMs, which can generate text faster, without needing to train them completely from the beginning.
What's the problem?
Currently, LLMs are really good at creating text, but they do it one word at a time, which is slow. DLMs offer a faster, parallel approach to text generation, but building these large models from scratch is incredibly expensive and doesn't take advantage of the knowledge already built into existing LLMs. Previous attempts to adapt existing LLMs to the DLM format haven't fully addressed the differences in how these two types of models process information – LLMs look at text in a specific order, while DLMs can look at chunks of text more freely.
What's the solution?
The researchers realized that an LLM is essentially a DLM where the 'chunks' it looks at are only one word long. They developed a way to gradually increase the size of these chunks, effectively transforming the LLM into a DLM. This involves using a special attention mechanism that allows the model to consider context while still allowing for bidirectional reasoning within each chunk, a parallel adaptation process, and a way to ensure the model doesn't forget what it already learned. They call their method NBDiff-7B.
Why it matters?
This work is important because it provides a much more efficient way to build powerful DLMs. Instead of spending huge amounts of money and time training a model from scratch, you can adapt an existing LLM, saving resources and still achieving state-of-the-art performance on various tasks like general knowledge, math, and coding. This makes advanced language models more accessible and practical.
Abstract
Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.