LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu

2025-12-16

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Summary

This paper introduces LongVie 2, a new system for creating realistic and controllable videos over extended periods of time, and a benchmark called LongVGenBench to test these kinds of systems.

What's the problem?

Building a 'world model' for video – meaning a system that can understand and realistically generate video sequences – is really hard. These models need to be able to follow instructions (controllability), look good for a long time without getting blurry or weird (visual quality), and make sure everything flows smoothly from one moment to the next (temporal consistency). Existing systems often struggle with all three, especially when trying to generate videos that last for more than a few seconds.

What's the solution?

The researchers took a step-by-step approach to build LongVie 2. First, they made the system better at following instructions by using different types of control signals. Then, they trained it to avoid the quality issues that usually happen when generating long videos. Finally, they made sure that each part of the video logically connects to the parts before and after it. They also created a new set of 100 long, high-quality videos to test how well their system, and others, perform.

Why it matters?

This work is important because it represents a significant advance in creating AI that can understand and generate realistic video. Being able to create long, controllable, and high-quality videos is a key step towards building more generally intelligent systems that can interact with and understand the visual world around us, potentially impacting fields like robotics, virtual reality, and content creation.

Abstract

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

View Paper