VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu
2025-02-12
Summary
This paper talks about VidCRAFT3, a new AI system that can turn a single image into a video while giving users control over three important aspects: how the camera moves, how objects in the scene move, and how the lighting changes. It's like having a virtual movie studio where you can direct the action from just one picture.
What's the problem?
Current AI systems that turn images into videos can usually only control one or two things at a time, like how the camera moves or how objects move. They can't handle controlling multiple aspects at once because they don't have enough data to learn from and their networks aren't designed to handle so many things at the same time.
What's the solution?
The researchers created VidCRAFT3, which uses a special AI design called a Spatial Triple-Attention Transformer to handle camera movement, object movement, and lighting all at once. They also made a new dataset with computer-generated videos that include information about lighting, which most real video datasets don't have. This helps the AI learn how light affects scenes. They then used a three-step training process to teach the AI how to control all these elements without needing videos that show all three changing at the same time.
Why it matters?
This matters because it brings us closer to being able to create realistic videos from single images with a lot more control. It could be used in fields like movie making, video game design, or virtual reality to create more dynamic and realistic scenes without needing to film everything in real life. It also shows how AI can be taught to understand complex visual elements, which could help improve computer vision and graphics in many other applications.
Abstract
Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera trajectory or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. To better decouple control over each visual element, we propose the Spatial Triple-Attention Transformer, which integrates lighting direction, text, and image in a symmetric way. Since most real-world video datasets lack lighting annotations, we construct a high-quality synthetic video dataset, the VideoLightingDirection (VLD) dataset. This dataset includes lighting direction annotations and objects of diverse appearance, enabling VidCRAFT3 to effectively handle strong light transmission and reflection effects. Additionally, we propose a three-stage training strategy that eliminates the need for training data annotated with multiple visual elements (camera motion, object motion, and lighting direction) simultaneously. Extensive experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content, surpassing existing state-of-the-art methods in terms of control granularity and visual coherence. All code and data will be publicly available. Project page: https://sixiaozheng.github.io/VidCRAFT3/.