ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
2025-12-16
Summary
This paper introduces ReFusion, a new type of model for generating text that aims to combine the strengths of two existing approaches – autoregressive models and masked diffusion models – while overcoming their weaknesses.
What's the problem?
Currently, generating text has two main approaches, each with drawbacks. Autoregressive models generate text one word at a time, which is accurate but slow. Masked diffusion models can generate text in parallel, making them faster, but they require a lot of computing power and sometimes produce text that doesn't quite make sense because they struggle to understand the relationships between words.
What's the solution?
ReFusion solves this by breaking down the text into larger 'slots' of words instead of individual words. It first 'plans' which slots can be generated independently, then fills those slots in parallel using a diffusion process. Finally, it uses a more traditional, step-by-step approach to refine the connections between the slots. This allows for faster generation with better coherence and efficient use of computing resources.
Why it matters?
ReFusion is important because it significantly improves the speed and quality of text generation. It's faster than previous parallel methods and performs almost as well as the slower, more accurate methods, offering a good balance between speed and quality. This could lead to improvements in applications like chatbots, content creation, and machine translation.
Abstract
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.