MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks
Mingsong Li, Lin Liu, Hongjun Wang, Haoxing Chen, Xijun Gu, Shizhan Liu, Dong Gong, Junbo Zhao, Zhenzhong Lan, Jianguo Li
2025-09-19
Summary
This paper introduces a new, large dataset called MultiEdit designed to help improve how well computers can edit images based on text instructions.
What's the problem?
Current methods for image editing using instructions aren't very good at complex edits because the datasets they learn from are limited in the types of edits they include and don't have enough examples. Also, many existing datasets have errors in the descriptions of the images, which can confuse the computer and make it harder to learn how to edit images correctly.
What's the solution?
The researchers created MultiEdit, a dataset with over 107,000 examples of image edits. This dataset covers a wide range of edits, including both changing the style of an image and making more complex changes like editing text within an image or changing specific objects. They used advanced AI models to automatically create both the instructions for the edits and the edited images themselves, ensuring high quality and consistency.
Why it matters?
This new dataset is important because it allows researchers to train more powerful image editing models that can handle more challenging and realistic editing tasks. By providing a better resource for learning, MultiEdit will help advance the field of image editing and make it possible for computers to understand and execute more complex image manipulation requests.
Abstract
Current instruction-based image editing (IBIE) methods struggle with challenging editing tasks, as both editing types and sample counts of existing datasets are limited. Moreover, traditional dataset construction often contains noisy image-caption pairs, which may introduce biases and limit model capabilities in complex editing scenarios. To address these limitations, we introduce MultiEdit, a comprehensive dataset featuring over 107K high-quality image editing samples. It encompasses 6 challenging editing tasks through a diverse collection of 18 non-style-transfer editing types and 38 style transfer operations, covering a spectrum from sophisticated style transfer to complex semantic operations like person reference editing and in-image text editing. We employ a novel dataset construction pipeline that utilizes two multi-modal large language models (MLLMs) to generate visual-adaptive editing instructions and produce high-fidelity edited images, respectively. Extensive experiments demonstrate that fine-tuning foundational open-source models with our MultiEdit-Train set substantially improves models' performance on sophisticated editing tasks in our proposed MultiEdit-Test benchmark, while effectively preserving their capabilities on the standard editing benchmark. We believe MultiEdit provides a valuable resource for advancing research into more diverse and challenging IBIE capabilities. Our dataset is available at https://huggingface.co/datasets/inclusionAI/MultiEdit.