EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

2025-12-19

EasyV2V: A High-quality Instruction-based Video Editing Framework

Summary

This paper introduces EasyV2V, a new and straightforward method for editing videos based on text instructions, aiming to make video editing as easy and versatile as image editing.

What's the problem?

While we've gotten really good at editing single images with AI, editing videos is much harder. It's tough to make changes that look consistent throughout the video, to have precise control over *what* gets changed, and to create a system that works well with lots of different kinds of videos. Existing video editing tools often fall short in these areas.

What's the solution?

The researchers tackled this by improving three key areas. First, they created better datasets for training the AI by combining existing resources and adding new ones focused on how edits happen over time. Second, they realized that powerful text-to-video AI models already had some editing ability, so they simplified the process by just slightly adjusting these existing models instead of building something completely new. Finally, they developed a unified way to control edits using masks (to specify *where* to edit) and optional reference images, allowing for flexible input like just a video and text, or a video, text, mask, and reference image.

Why it matters?

EasyV2V is a big step forward because it achieves better video editing results than current methods, including commercial software. It makes video editing more accessible and controllable, potentially opening up new possibilities for content creation and manipulation. It shows that we can leverage existing AI models to perform complex tasks like video editing with relatively little extra training.

Abstract

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

View Paper