LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu

2025-08-06

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Summary

This paper talks about LongVie, a new AI system designed to generate very long videos that stay visually consistent and clear throughout, while allowing users to control the video content using different types of guidance signals.

What's the problem?

The problem is that creating long videos with AI is hard because the videos often end up with flickering or changing details, and the quality usually gets worse over time, especially when only one type of control signal is used during generation.

What's the solution?

LongVie solves this by using a unified noise start method for consistency, normalizing control signals globally to keep alignment, and combining multiple control methods like depth maps and keypoints to guide the video generation while training the model to handle visual quality loss over time.

Why it matters?

This matters because it enables the creation of long, high-quality, and controllable videos, which can be useful for video editing, animation, virtual reality, and other creative or practical applications that need stable and realistic long videos.

Abstract

LongVie, an end-to-end autoregressive framework, addresses temporal consistency and visual degradation in ultra-long video generation through unified noise initialization, global control signal normalization, multi-modal control, and degradation-aware training.

View Paper