Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
2025-12-18
Summary
This paper focuses on making large language models, specifically those based on the 'transformer' architecture, run much faster. It tackles the challenge of speeding up how these models generate text, moving beyond the standard, slower method.
What's the problem?
Currently, a popular way to speed things up is to convert models that normally generate text one word at a time (like how we write) into models that can generate multiple words at once. However, this conversion often leads to a loss in quality because the way the model is trained *after* this conversion doesn't match how it was originally trained. Also, generating words in parallel messes with the model's understanding of word order, which is crucial for good text generation, and makes it hard to reuse previously calculated information to speed things up.
What's the solution?
The researchers introduce a technique called 'Jacobi Forcing'. This involves training the model to generate text in parallel *using its own generated outputs* as training data. Think of it like the model learning from its own attempts, gradually shifting it towards parallel generation while still maintaining its original understanding of language. They also developed a decoding method called 'multi-block decoding with rejection recycling' which further improves speed by accepting more generated words per step, even if it requires a bit more computing power.
Why it matters?
This work is important because it significantly speeds up large language models – achieving up to 4.5 times faster generation – without sacrificing the quality of the text they produce. Faster models mean quicker responses in applications like chatbots, code generation, and more, making these powerful tools more practical and accessible.
Abstract
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.