VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim

2024-12-04

VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Summary

This paper presents VideoGen-of-Thought (VGoT), a new framework designed to generate multi-shot videos that tell coherent stories, similar to movies.

What's the problem?

Current video generation models are good at creating short clips but struggle with longer, multi-shot videos that require a logical storyline and visual consistency. Many existing models are trained to create single shots, making it hard for them to connect different scenes smoothly and maintain a coherent narrative throughout a longer video.

What's the solution?

VGoT addresses this problem by breaking down the video generation process into several structured steps. First, it generates a script that outlines the story. Then, it creates keyframes that visually represent important moments in the story. After that, it generates the actual video shots based on the script and keyframes. Finally, it includes a smoothing mechanism to ensure that transitions between shots are seamless. This structured approach helps maintain narrative flow and visual consistency across multiple shots.

Why it matters?

This research is important because it enhances the ability of AI to create high-quality, movie-like videos that are coherent and engaging. By improving how multi-shot videos are generated, VGoT can be applied in various fields such as filmmaking, advertising, and education, where storytelling is crucial. This could lead to more advanced video content creation tools that allow users to produce professional-quality videos with minimal effort.

Abstract

Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.

View Paper