Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
2026-03-04
Summary
This paper focuses on making video editing easier and more precise by using both text instructions *and* visual examples. It tackles the challenge of needing lots of good examples to train these kinds of editing systems.
What's the problem?
Currently, telling a computer to edit a video using just words is hard because language isn't always specific enough to describe exactly what you want visually. While showing the computer a 'before' and 'after' example (reference-guided editing) works better, getting enough of these paired examples to train the system is a major roadblock. There just isn't enough high-quality data available.
What's the solution?
The researchers created a way to automatically generate more training data. They use AI image generators to create new 'before' examples that match existing 'after' examples, effectively expanding the dataset. They then built a new dataset called RefVIE using this method, along with a testing set called RefVIE-Bench. Finally, they designed a new video editing model, Kiwi-Edit, that's really good at understanding both the text instructions and the visual references, and they trained it using a smart, step-by-step approach.
Why it matters?
This work is important because it significantly improves the ability to control video editing with precision. By overcoming the data shortage problem and creating a better model, they've set a new standard for how well computers can follow instructions and visual guidance when editing videos, opening the door for more user-friendly and powerful video editing tools.
Abstract
Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.