Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann

2024-07-05

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Summary

This paper talks about Diffusion Forcing, a new method for training models that generate sequences of data, like text or video, by effectively managing noise and improving how the model predicts future tokens.

What's the problem?

The main problem with existing models is that they often struggle to generate long sequences of data accurately, especially when the data has different levels of noise. Traditional methods either focus on predicting the next token one at a time or diffusing an entire sequence at once, which can lead to inefficiencies and errors in generating complex outputs.

What's the solution?

To solve this issue, the authors introduce Diffusion Forcing, which trains a model to handle sets of tokens (pieces of data) with different noise levels. This allows the model to generate one or more future tokens without needing to completely process past tokens first. By combining the strengths of next-token prediction (which allows for flexible lengths) with full-sequence diffusion (which guides the generation process), Diffusion Forcing enables the model to create longer and more complex sequences while maintaining accuracy. It also includes new techniques for sampling and guiding the generation process, leading to better performance in tasks like decision-making and planning.

Why it matters?

This research is important because it enhances how AI models generate sequences, making them more efficient and reliable. By improving the ability to manage noise and predict future data points, Diffusion Forcing can be applied in various fields such as video generation, robotics, and any area where understanding and creating sequences is crucial.

Abstract

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing/

View Paper