FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

2026-01-02

FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Summary

This paper focuses on making video generation models faster without losing quality. It observes that the complexity of the model needed changes throughout the video creation process.

What's the problem?

Generating videos with advanced AI models is computationally expensive and slow. Existing methods often apply the same complex model throughout the entire process, even when a simpler model would suffice, wasting processing power and time. The paper identifies that the most powerful parts of the model are really needed only at the beginning and end of video generation, but aren't as crucial in the middle.

What's the solution?

The researchers developed a technique called FlowBlending. This method intelligently switches between using a large, complex model and a smaller, faster model during video generation. It uses the large model when detail is important – at the start and finish – and the smaller model during the less critical middle stages. They also figured out a way to automatically determine where to make this switch based on how the video is progressing.

Why it matters?

FlowBlending significantly speeds up video generation, making it up to 1.65 times faster while using considerably less computing power (57.35% fewer calculations). Importantly, it does this without sacrificing the visual quality, smoothness, or overall meaning of the generated videos. This makes advanced video generation more accessible and practical, and can be combined with other speed-up techniques for even greater gains.

Abstract

In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

View Paper