ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling

2025-10-07

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Summary

This paper introduces ChronoEdit, a new method for editing images that focuses on making sure the changes look realistic and follow the laws of physics, like how objects move and interact in the real world.

What's the problem?

Current image editing tools, even the really advanced ones using artificial intelligence, often struggle with physical consistency. If you change something in an image, like moving an object, the result can look unnatural or impossible because the edit doesn't account for how things would actually behave. This is a big problem for things like creating realistic simulations of the world.

What's the solution?

ChronoEdit tackles this by thinking of image editing as creating a short video. It takes the original image and the edited image and treats them like the first and last frames of a video. Then, it uses powerful AI models trained on videos to fill in the missing frames, essentially 'imagining' how the change happens over time. This forces the edit to be physically plausible. Importantly, it doesn't actually render the whole video; it uses a 'reasoning' step to figure out a realistic path for the change and then stops, saving a lot of computing power.

Why it matters?

This work is important because it significantly improves the realism of image editing, especially when changes involve how objects interact with the world. It opens the door to creating more believable simulations and virtual environments, and provides a new benchmark, PBench-Edit, to measure progress in this area. The tools developed will be publicly available for others to use and build upon.

Abstract

Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit

View Paper