Flowception: Temporally Expansive Flow Matching for Video Generation

Tariq Berrada Ifriqi, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen

2025-12-16

Flowception: Temporally Expansive Flow Matching for Video Generation

Summary

This paper introduces Flowception, a new way to create videos using artificial intelligence, focusing on generating videos that aren't built frame-by-frame in a rigid order.

What's the problem?

Existing methods for generating videos often struggle with creating long, consistent videos. Some methods build videos one frame at a time, which can lead to errors building up over time and the video drifting from its intended content. Other methods are computationally expensive, requiring a lot of processing power and time to train, and don't easily allow for varying video lengths.

What's the solution?

Flowception tackles these issues by learning a 'path' for video creation that mixes two key steps: inserting complete frames and then subtly refining (denoising) existing frames. This approach is different because it doesn't have to generate each frame sequentially, which helps prevent errors from accumulating. It also significantly reduces the amount of computation needed for training compared to other methods, and it can learn the appropriate length of the video while it's being created.

Why it matters?

This research is important because it offers a more efficient and effective way to generate high-quality videos. The improved speed and quality open doors for applications like turning images into videos or filling in missing frames in existing footage, and it provides a foundation for more advanced video generation techniques.

Abstract

We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.

View Paper