LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu
2025-08-06
Summary
This paper talks about LongVie, a new AI system designed to generate very long videos that stay visually consistent and clear throughout, while allowing users to control the video content using different types of guidance signals.
What's the problem?
The problem is that creating long videos with AI is hard because the videos often end up with flickering or changing details, and the quality usually gets worse over time, especially when only one type of control signal is used during generation.
What's the solution?
LongVie solves this by using a unified noise start method for consistency, normalizing control signals globally to keep alignment, and combining multiple control methods like depth maps and keypoints to guide the video generation while training the model to handle visual quality loss over time.
Why it matters?
This matters because it enables the creation of long, high-quality, and controllable videos, which can be useful for video editing, animation, virtual reality, and other creative or practical applications that need stable and realistic long videos.
Abstract
LongVie, an end-to-end autoregressive framework, addresses temporal consistency and visual degradation in ultra-long video generation through unified noise initialization, global control signal normalization, multi-modal control, and degradation-aware training.