DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
2024-11-26
Summary
This paper introduces DreamRunner, a new method for generating storytelling videos that can handle complex motions and multiple scenes based on a text script.
What's the problem?
Creating storytelling videos involves several challenges, such as making sure objects move in realistic ways, maintaining consistency of characters across different scenes, and smoothly transitioning between various actions within a scene. These tasks are difficult for existing video generation models, which often struggle to represent the story accurately and fluidly.
What's the solution?
DreamRunner addresses these challenges by first using a large language model to break down the input script into scenes and detailed motions. It then employs a technique called retrieval-augmented test-time adaptation to customize the motions of objects based on examples from other videos. Additionally, it incorporates a special module for managing how objects move and interact frame by frame. This combination allows DreamRunner to produce videos that are coherent and visually engaging while following the story closely.
Why it matters?
This research is significant because it enhances the ability to create high-quality storytelling videos, which can be valuable for media and entertainment industries. By improving how videos are generated from scripts, DreamRunner can help filmmakers and content creators produce more dynamic and engaging narratives, making storytelling more accessible and creative.
Abstract
Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.