CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai

2025-02-13

CineMaster: A 3D-Aware and Controllable Framework for Cinematic
Text-to-Video Generation

Summary

This paper talks about CineMaster, a new AI system that can create videos from text descriptions while giving users the ability to control the 3D aspects of the video, like where objects are placed and how the camera moves, similar to how a film director would.

What's the problem?

Current AI systems that make videos from text descriptions don't give users much control over the 3D aspects of the scene, like where things are placed or how the camera moves. This makes it hard for people to create exactly the video they want, especially if they need specific camera angles or object placements.

What's the solution?

The researchers created CineMaster, which works in two stages. First, it lets users place objects in a 3D space and plan camera movements, kind of like setting up a virtual movie set. Then, it uses this information along with the text description to generate a video that matches what the user wanted. They also created a way to automatically label lots of existing videos with 3D information, which helps train the AI to understand 3D scenes better.

Why it matters?

This matters because it could change how people create videos using AI. Instead of just getting a random video based on a text description, users could have much more control over the final product, almost like being a movie director. This could be useful for filmmakers, advertisers, or anyone who needs to create specific video content without expensive equipment or large production teams.

Abstract

In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.

View Paper