ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, Anyi Rao

2026-03-13

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Summary

This paper introduces a new way to create videos from text descriptions, specifically focusing on making those videos look more cinematic with multiple camera shots.

What's the problem?

Currently, making videos from text is getting easier, but controlling the camera movements to create a professional, movie-like feel is really hard. Simply writing what you want isn't precise enough, and manually planning every camera angle is time-consuming and often doesn't even work well with existing video generation models.

What's the solution?

The researchers propose a system called ShotVerse that breaks the process into two parts: a 'Planner' that uses a vision-language model to figure out good camera paths based on the text, and a 'Controller' that actually creates the video following those paths. A key part of this is creating a large dataset of videos with precise camera information, which they did using a new automated system to align different shots together. This dataset, called ShotVerse-Bench, helps train and test their system.

Why it matters?

This work is important because it helps bridge the gap between easily writing a description for a video and actually getting a high-quality, cinematic result. It makes creating multi-shot videos more reliable and less reliant on either vague instructions or a ton of manual work, ultimately making video creation more accessible and professional-looking.

Abstract

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

View Paper