Terminal Velocity Matching
Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song
2025-11-27
Summary
This paper introduces a new technique called Terminal Velocity Matching (TVM) for creating realistic images using diffusion models, which are a type of AI that generates data by gradually removing noise. It's a way to improve how quickly and accurately these models can create high-quality images.
What's the problem?
Existing diffusion models often require many steps to generate a clear image, making them slow. Also, some of the most powerful models, called Diffusion Transformers, aren't stable when trying to generate images in just a few steps. The core issue is controlling how the model changes the image during the noise removal process, specifically ensuring it behaves predictably at the *end* of the process, not just the beginning.
What's the solution?
TVM tackles this by focusing on controlling the image generation process at its final stage, the 'terminal velocity'. The researchers proved that if the model doesn't change the image too drastically at each step, the generated images will closely resemble the real images. Because Diffusion Transformers don't naturally have this property, they made small changes to the model's design to ensure stability. They also developed a faster way to calculate the necessary information for the model to work efficiently with these changes, using something called a 'fused attention kernel'.
Why it matters?
This research is important because it allows for much faster image generation without sacrificing quality. The TVM method achieves state-of-the-art results, creating incredibly realistic images with only one or a few steps, which is a significant improvement over previous methods. This could lead to quicker and more efficient AI image creation for various applications like art, design, and scientific visualization.
Abstract
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the 2-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.