TTM takes an input image and a user-specified motion, then automatically builds a coarse warped reference video and a mask marking the controlled region. The image-to-video diffusion model is conditioned on the clean input image and initialized from a noisy version of the warped reference, anchoring appearance while injecting the intended motion. During sampling, dual-clock denoising is applied to enforce the commanded motion and enable natural dynamics.
Time-to-Move enables joint control over both motion and appearance, allowing for the insertion of new objects from outside the original image and the modification of an existing object’s appearance. Experiments demonstrate that TTM achieves comparable or superior performance to training-based baselines in both realism and motion fidelity. This flexible approach yields realistic dynamics without artifacts, making it a powerful tool for video generation and manipulation.

