UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, Shengyu Zhang
2025-12-03
Summary
This paper focuses on the growing difference in ability between image editing models created by big companies (like those behind GPT-4o) and those made by the open-source community. The main issue is that open-source models lack the huge amounts of good training data needed to compete.
What's the problem?
Creating enough training data for these image editing models is really hard. Getting people to manually label everything is accurate but takes forever and doesn't scale up. Trying to automatically create the data is faster, but often introduces errors and makes the data unreliable. Essentially, you can have a lot of data that isn't very good, or a little data that's really good, but not both.
What's the solution?
The researchers developed a new system to automatically create a large dataset of image editing examples. They used a powerful AI model to generate edits and then another AI model, called Qwen-Verify, to check the quality of those edits and fix any mistakes. This resulted in a dataset called UnicEdit-10M, containing 10 million examples. They also created a new set of tests, called UnicBench, to really push these models and see where they struggle, focusing on things like understanding spatial relationships and using general knowledge.
Why it matters?
This work is important because it provides the open-source community with a large, high-quality dataset and a challenging benchmark to improve their image editing models. By identifying the specific weaknesses of current models, it guides future research and helps close the performance gap with closed-source alternatives, ultimately making better image editing technology available to everyone.
Abstract
With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, Qwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields UnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose UnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including Non-edit Consistency and Reasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.