Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany

2025-11-13

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Summary

This paper introduces a new method called Time-to-Move (TTM) for creating videos with very specific movements and appearances using existing AI video generators.

What's the problem?

Currently, making videos with AI is good at creating realistic visuals, but it's hard to precisely control *how* things move in the video. Existing methods for controlling motion require a lot of extra work to train the AI model for each specific movement, which is time-consuming and limits what you can do. Simply telling the AI what you want with text or a starting image isn't enough for detailed motion control.

What's the solution?

The researchers developed TTM, which doesn't require any additional training of the AI model. Instead, you give it a rough animation showing the desired motion – something you could create with simple tools like dragging and dropping objects or adjusting depth. TTM then uses this rough animation as a guide, while still letting the AI fill in the details to make the movement look natural. It also uses a clever technique called 'dual-clock denoising' which focuses on getting the motion right in the areas you specify, but allows more freedom elsewhere. This works with any existing AI video generator without changing the generator itself.

Why it matters?

This is important because it makes it much easier to create videos with exactly the movements you want, without needing a supercomputer or specialized training. It also allows for very precise control over what things look like, even down to the individual pixels, which is something that’s difficult to achieve with just text prompts. This opens up possibilities for creating more customized and realistic AI-generated videos.

Abstract

Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

View Paper