LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen
2025-12-19
Summary
This paper introduces LLaDA2.0, a new way to build very large language models (LLMs) with 100 billion parameters. Instead of building these models from scratch, which is incredibly expensive, they convert existing, smaller models into a more powerful and efficient format.
What's the problem?
Training extremely large language models from the beginning requires massive amounts of computing power and data, making it very costly and inaccessible to many researchers. Existing methods for scaling up models often sacrifice performance or efficiency. The goal is to create a large, powerful LLM without the huge expense of training from zero.
What's the solution?
The researchers developed a three-step process to convert a pre-trained language model into a 'discrete diffusion large language model' (dLLM). First, they gradually increase the size of the data chunks the model processes during training ('warm-up'). Then, they train on full sequences of data ('stable'). Finally, they reduce the chunk size again ('decay'). They also fine-tuned the models using techniques like SFT and DPO to make them better at following instructions. This resulted in two models, LLaDA2.0-mini and LLaDA2.0-flash, which use a 'Mixture-of-Experts' approach to improve performance.
Why it matters?
This work is important because it provides a more affordable and efficient way to create state-of-the-art LLMs. By reusing existing models, they significantly reduce the computational cost and make these powerful tools more accessible. The models are also open-sourced, meaning anyone can use and build upon them, accelerating progress in the field of artificial intelligence.
Abstract
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.