DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
Runze Zhang, Guoguang Du, Xiaochuan Li, Qi Jia, Liang Jin, Lu Liu, Jingjing Wang, Cong Xu, Zhenhua Guo, Yaqian Zhao, Xiaoli Gong, Rengang Li, Baoyu Fan
2025-03-18
Summary
This paper introduces DropletVideo, a dataset and model designed to improve the spatio-temporal consistency in AI-generated videos, focusing on how camera movements and plot developments interact.
What's the problem?
Current AI models struggle to create videos where the objects and scenes remain consistent across different viewpoints and where the story makes sense even with complex camera movements. They often focus on either spatial or temporal consistency, but not the interplay between them.
What's the solution?
The researchers created a large dataset called DropletVideo-10M with 10 million videos featuring dynamic camera motion and object actions, each annotated with detailed captions. They then trained the DropletVideo model on this dataset, which excels in preserving spatio-temporal coherence during video generation.
Why it matters?
This work matters because it addresses a key challenge in video generation, leading to AI models that can create more realistic and coherent videos with complex camera movements and plot developments.
Abstract
Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at https://dropletx.github.io.