LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, Zhijie Deng

2025-12-23

LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Summary

This paper focuses on making diffusion large language models, which are powerful AI for generating text and images, run much faster. They found a way to speed up how these models create outputs, specifically by improving how they decide which parts of the output to work on at the same time.

What's the problem?

Currently, these models generate text or images one piece at a time, or only a few pieces at a time. This is because the way they choose the order to fill in these pieces, called the Token Filling Order, limits how much they can do in parallel. Essentially, the process is bottlenecked, making it slower than it could be. Existing methods for generating outputs rely on how confident the model is, but this doesn't maximize speed.

What's the solution?

The researchers developed a new technique called LoPA, which stands for Lookahead Parallel Decoding. LoPA doesn't require any extra training; it can be added to existing models. It works by testing out different orders for filling in the output pieces *simultaneously*. It then picks the order that seems most likely to allow for more pieces to be generated at the same time in the future. To take full advantage of this, they also created a new system for running the model across multiple GPUs, called Branch Parallelism, which allows for a huge increase in processing speed.

Why it matters?

This work is important because it significantly speeds up diffusion large language models. They were able to more than triple the speed of one model, going from generating 1-3 pieces of output at a time to over 10, without sacrificing the quality of the output. Faster models mean quicker responses and more efficient use of computing resources, making these powerful AI tools more practical for a wider range of applications.

Abstract

Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

View Paper