BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
2025-12-05
Summary
This paper introduces a new way to create videos using artificial intelligence, giving creators much more control over what happens in the video and how the camera moves.
What's the problem?
Current AI video generators are really good at making videos look realistic, but they treat the action in the scene and the camera movement as one connected thing. This makes it hard to precisely control *both* what's happening *and* how you're looking at it – you can't easily change the timing of events or the camera angle independently.
What's the solution?
The researchers developed a system that separates how things move in the scene from how the camera moves. They feed the AI information about both the timing of events (what they call 'world-time') and the path the camera takes. This information is added to the AI model in a special way that lets it understand and respond to changes in either one. They also created a new dataset specifically for training this type of system, where the timing and camera movements are carefully controlled and can be changed separately.
Why it matters?
This is important because it unlocks a new level of creative control for video creation. Imagine being able to easily adjust the timing of an action scene or change the camera angle to focus on a specific detail, all without having to re-record everything. This could be useful for filmmakers, game developers, or anyone who wants to create high-quality videos with precise control over every aspect.
Abstract
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/