Region-Constraint In-Context Generation for Instructional Video Editing
Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei
2025-12-23
Summary
This paper introduces a new method, called ReCo, for editing videos based on text instructions. It focuses on making the editing process more accurate and preventing unwanted changes to parts of the video that *shouldn't* be edited.
What's the problem?
When you try to edit a video using just text instructions, the computer can sometimes get confused about *where* to make the changes. It might edit the wrong areas, or changes in the edited area might accidentally affect the rest of the video. This happens because the system struggles to clearly separate what needs to be changed from what should stay the same, and different parts of the video can interfere with each other during the editing process.
What's the solution?
ReCo solves this by looking at the original and edited videos *together* during the editing process. It uses two main techniques to guide the editing. First, it makes sure the edited parts of the video look very different from the original, while keeping the non-edited parts similar. Second, it prevents the edited area from 'paying attention' to the original video in those same areas, which reduces unwanted interference. To help train the system, the researchers also created a large dataset of videos with editing instructions.
Why it matters?
This research is important because it makes instructional video editing much more reliable and effective. Being able to easily edit videos with simple text commands has a lot of potential for things like creating content, special effects, and even helping people with video editing who don't have a lot of technical skills. Improving the accuracy and control of this process is a big step forward.
Abstract
The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.