GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, Jun Gao

2025-03-06

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera
Control

Summary

This paper talks about GEN3C, a new AI system that can create realistic videos with precise camera movements and consistent 3D scenes

What's the problem?

Current AI video generators often make mistakes like having objects suddenly appear or disappear, and they struggle to accurately control the camera movement in the video. This is because they don't really understand the 3D structure of the scene they're creating

What's the solution?

The researchers created GEN3C, which uses a '3D cache' - like a 3D map of the scene - to guide the video creation. This cache is made from depth information in the initial images or previously created frames. When making new frames, GEN3C uses this 3D information along with the desired camera movement to create a consistent and realistic video. This allows the AI to focus on adding new details and moving the scene forward, rather than trying to remember what it created before

Why it matters?

This matters because it could lead to much more realistic and controllable AI-generated videos. It could be really useful for things like making movies, video games, or even virtual reality experiences. The ability to create consistent 3D scenes with precise camera control could also help in fields like architecture visualization or autonomous vehicle training simulations

Abstract

We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/

View Paper