MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia
2025-12-03
Summary
This paper introduces a new system called MultiShotMaster that's designed to create longer, more complex videos with multiple scenes, going beyond what current video generators can do.
What's the problem?
Existing AI video generators are really good at making short, single clips, but they struggle when you want a video with a story that unfolds over multiple shots or scenes. It's hard to control the order of these scenes, make sure the story makes sense throughout, and give specific instructions beyond just a text description. Plus, there isn't a lot of training data available for these kinds of multi-shot videos.
What's the solution?
The researchers built MultiShotMaster by starting with a video generator that already makes good single clips and then adding two key improvements. First, they changed how the system understands time between shots to allow for flexible arrangement while still keeping the story flowing logically. Second, they added a way to inject specific visual references, like images or signals about where things should be in the scene, to give more control over what happens in each shot. To get enough data to train the system, they also created a tool to automatically label existing videos with the information needed for multi-shot generation.
Why it matters?
This work is important because it moves us closer to AI systems that can create full-fledged videos with narratives, not just short clips. It gives users more control over the content, allowing them to customize characters, actions, and scenes, and it opens the door for creating more engaging and complex video content using AI.
Abstract
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.