Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu

2025-12-09

Unified Video Editing with Temporal Reasoner

Summary

This paper introduces a new method called VideoCoF for editing videos, aiming to combine the best parts of existing techniques while overcoming their limitations.

What's the problem?

Currently, video editing AI faces a dilemma. Highly accurate editing requires specific information from the user, like exactly which parts of the video to change (masks), which makes it hard to create a single system that can do many different edits. On the other hand, simpler methods that don't need masks aren't very precise and struggle to understand what part of the video the instructions refer to, leading to inaccurate edits.

What's the solution?

VideoCoF solves this by using a 'see, reason, then edit' process. It first asks the AI to think about *where* in the video needs to be changed, creating a sort of internal plan before actually making the edit. This 'reasoning' step uses special tokens that represent the areas to be modified. They also added a technique to make sure the changes flow smoothly over time and can even work on videos longer than those the AI was trained on. They only needed a relatively small amount of example videos – 50,000 pairs – to get great results.

Why it matters?

This research is important because it allows for more precise and versatile video editing AI. By removing the need for users to manually specify edit regions, it makes video editing much easier and more accessible, while still achieving high-quality results. It’s a step towards AI that can understand and manipulate video content in a more intelligent and intuitive way.

Abstract

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.

View Paper