SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang
2026-01-01
Summary
This paper introduces SpaceTimePilot, a new computer graphics model that can create videos from a single image and then lets you change both the camera angle *and* the motion within the video, all while keeping things realistic.
What's the problem?
Normally, if you want to change how a video looks – like the camera’s position or what’s happening in the scene – you need a lot of data showing those changes already. Creating this data is really hard, especially when you want to explore *continuous* changes in both space and time. Existing methods struggle to independently control both camera movement and the action happening in the video at the same time, and there aren't datasets available with videos showing the same scene with lots of different movements.
What's the solution?
The researchers tackled this by building a model that separates how things move in space (camera angles) from how they move in time (the action itself). They did this using a special technique called 'animation time-embedding' which lets the model understand and control the timing of the video. Because they couldn't find the right data, they cleverly 'warped' existing multi-view datasets to *look* like videos with different timings. They also created a new dataset, CamxTime, with videos showing completely free movement in both space and time to help train the model even better. Finally, they improved how the camera information is used to make changes from the very first frame.
Why it matters?
This work is important because it makes it much easier to create and edit videos using AI. Imagine being able to take a single picture and then virtually 'fly' through the scene, or change what someone is doing in the video, all without needing a ton of pre-recorded footage. This could be useful for creating special effects, virtual reality experiences, or even just making fun, personalized videos.
Abstract
We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot