Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
2025-10-20
Summary
This paper introduces a new system called Ditto, designed to make it easier to edit videos using simple text instructions, like you'd give to a video editor. They also created a huge dataset of video editing examples to train their system.
What's the problem?
Currently, creating AI that can edit videos based on text instructions is difficult because there isn't enough good quality data to train these AI models. Building this data is expensive and time-consuming, and existing methods don't produce diverse or high-quality results. It's a trade-off between cost, quality, and making sure the video edits look smooth over time.
What's the solution?
The researchers built a system that automatically generates video editing examples. It combines a powerful image editor with a video generator, and uses a smart 'agent' to create varied instructions and filter out bad results. They also made the process more efficient by using a smaller, faster model with a special component to improve the flow of the video. They used this system to create a dataset of one million video editing examples called Ditto-1M, and then trained a model, Editto, on this data.
Why it matters?
This work is important because it addresses a major roadblock in making video editing accessible to everyone. By creating a large, high-quality dataset and an efficient training system, they've significantly improved the ability of AI to follow video editing instructions, paving the way for easier and more powerful video creation tools for everyone.
Abstract
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.