Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
Jonas Geiping, Xinyu Yang, Guinan Su
2025-10-17
Summary
This paper explores a new way to speed up the generation of text from advanced language models that have a 'recurrent depth', meaning they repeat layers to process information. It connects these models to another type of model called 'diffusion language models' and uses ideas from diffusion models to make text generation faster.
What's the problem?
Modern language models are getting really good, but they can be slow to generate text, especially the more complex ones with recurrent depth. These models have the potential for better reasoning, but that extra processing power takes time. The challenge is to find a way to use that extra processing without making generation even slower.
What's the solution?
The researchers developed a new 'sampler' – essentially a set of instructions for how the model generates text. This sampler takes advantage of the recurrent nature of the model, decoding new parts of the text with each pass but also refining those parts in parallel. This allows the model to work on multiple parts of the text simultaneously, making it significantly faster. It works without needing any adjustments to existing models.
Why it matters?
This work is important because it provides a way to efficiently use the extra computational power of these advanced language models. It speeds up text generation, potentially making these powerful models more practical for real-world applications. It also suggests a new way to think about these models, viewing them as a type of diffusion model, which could lead to further improvements in the future.
Abstract
Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.