STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

2025-11-26

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Summary

This paper introduces STARFlow-V, a new way to generate videos using a technique called normalizing flows. Normalizing flows are good at creating data, but haven't been used much for videos because videos are complex and require a lot of computing power. This research shows that normalizing flows *can* be used to make high-quality videos, competing with the more popular method of diffusion models.

What's the problem?

Currently, the best video generation systems almost always rely on diffusion models. While effective, these models can struggle with maintaining consistency over time and are computationally expensive. Normalizing flows offer advantages like being able to directly calculate the probability of a video and learn everything in one go, but haven't been able to match the quality of diffusion models for video generation due to the challenges of handling the complex, sequential nature of video data and the potential for errors to build up over time.

What's the solution?

The researchers developed STARFlow-V, which uses a special architecture that separates global motion from detailed, local changes within each frame. This helps prevent errors from accumulating as the video is generated. They also introduced a new technique called 'flow-score matching' which acts like a lightweight editor, ensuring the video stays consistent. Finally, they improved how the model creates each frame, making it faster by allowing parts of the process to happen simultaneously without messing up the order of events. This model can also create videos from text, images, or other videos.

Why it matters?

This work is important because it demonstrates that normalizing flows are a viable option for high-quality video generation, something that hasn't been proven before. It opens up a new research direction for building 'world models' – AI systems that can understand and generate realistic simulations of the world around us. It also provides a potential alternative to diffusion models, offering benefits like faster generation and a better understanding of the generated content.

Abstract

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

View Paper