LLaDA2.1: Speeding Up Text Diffusion via Token Editing
Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, Zhuocheng Gong, Yanmei Gu, Jian Guan, Kaiyuan Guan, Hongliang He, Zenan Huang, Juyong Jiang, Zhonghui Jiang, Zhenzhong Lan, Chengxi Li, Jianguo Li, Zehuan Li
2026-02-10
Summary
This paper introduces LLaDA2.1, an improved version of a large language model (LLM) called LLaDA2.0, focusing on making it both faster and better at generating text.
What's the problem?
Previous large language models, like LLaDA2.0, faced a challenge: improving the quality of the text they generate often slowed down the speed at which they could create it, and vice versa. It was hard to get both high quality *and* fast performance at the same time. Essentially, there was a trade-off between how good the output was and how quickly it was produced.
What's the solution?
The researchers solved this by combining two different methods of text generation – one that prioritizes speed and another that prioritizes quality. They created a system with two 'modes': 'Speedy Mode' which generates text quickly but uses a second step to refine it, and 'Quality Mode' which focuses on accuracy and detail. They also used a new type of learning called Reinforcement Learning to help the model better understand and follow instructions, and they increased the amount of text the model can consider at once. Finally, they released two versions of the model, a smaller one (16B) and a larger one (100B).
Why it matters?
This work is important because it demonstrates a way to build very large language models that don't force you to choose between speed and quality. The resulting models, especially the 100B version, are incredibly fast at tasks like coding – generating code at a rate of almost 900 tasks per second – while still performing well on a variety of other benchmarks. This makes these models more practical for real-world applications where both speed and accuracy are crucial.
Abstract
While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.