ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo
2025-12-12
Summary
This paper focuses on improving how well video editing models understand and follow instructions that require reasoning about the real world, like physics or cause and effect.
What's the problem?
Current video editing models, even those good at understanding both images and language, struggle when asked to edit videos in a way that makes logical sense. This is because the datasets used to train these models aren't good enough for testing reasoning skills, and the models themselves don't naturally connect their understanding of a situation with the actual process of making changes to the video. They can *understand* what needs to happen, but not *make* it happen visually.
What's the solution?
The researchers created a new task called Reason-Informed Video Editing (RVE) that specifically tests a model’s ability to reason about how things work in the real world while editing videos. They also built a dataset, RVE-Bench, to evaluate these skills. To solve the problem, they developed a model called ReViSE, which uses a 'self-reflection' process. Essentially, the model checks its own work to see if the edited video actually makes sense according to the instructions, and then uses that feedback to improve its editing process. It’s like the model is constantly asking itself, 'Does this change actually fit with what I'm trying to achieve?'
Why it matters?
This work is important because it pushes video editing models beyond simply making visual changes and towards actually *understanding* what those changes mean. This is a crucial step towards creating AI that can realistically manipulate videos and create content that is both visually appealing and logically consistent, which has applications in areas like filmmaking, special effects, and even creating educational materials.
Abstract
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.