Planned Diffusion

Daniel Israel, Tian Jin, Ellie Cheng, Guy Van den Broeck, Aditya Grover, Suvinay Subramanian, Michael Carbin

2025-10-22

Summary

This paper introduces a new method called 'planned diffusion' for making large language models generate text faster without significantly sacrificing the quality of the output.

What's the problem?

Currently, there's a big challenge in getting large language models to generate text quickly *and* well. Traditional methods that produce high-quality text do so slowly, creating one word at a time. Other methods can generate text faster by working on multiple parts at once, but the quality isn't as good. It's a trade-off between speed and quality, and finding a good balance is difficult.

What's the solution?

Planned diffusion tackles this problem by combining the best parts of both approaches. First, the model quickly creates a 'plan' – a short outline breaking down the desired text into smaller, independent sections. Then, it generates all these sections simultaneously using a faster, but usually lower-quality, method. Because the plan guides the generation, the final result is high quality, but the process is much quicker overall.

Why it matters?

This research is important because it offers a practical way to speed up large language models without a major loss in text quality. The experiments show significant speed improvements – up to 1.81 times faster – with only a small decrease in how well the generated text performs, making these models more useful in real-world applications where quick responses are needed.

Abstract

A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87\% to 5.4\% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.

View Paper