Key Features

Training-free and plug-and-play framework
Adds precise motion control to existing video diffusion models
Uses crude reference animations as coarse motion cues
Adapts the mechanism of SDEdit to the video domain
Enables joint control over both motion and appearance
Preserves input details and faithfully follows the motion
Generates realistic videos without extra training or architectural changes
Flexible approach yields realistic dynamics without artifacts

TTM takes an input image and a user-specified motion, then automatically builds a coarse warped reference video and a mask marking the controlled region. The image-to-video diffusion model is conditioned on the clean input image and initialized from a noisy version of the warped reference, anchoring appearance while injecting the intended motion. During sampling, dual-clock denoising is applied to enforce the commanded motion and enable natural dynamics.


Time-to-Move enables joint control over both motion and appearance, allowing for the insertion of new objects from outside the original image and the modification of an existing object’s appearance. Experiments demonstrate that TTM achieves comparable or superior performance to training-based baselines in both realism and motion fidelity. This flexible approach yields realistic dynamics without artifacts, making it a powerful tool for video generation and manipulation.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!