UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang

2025-11-04

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Summary

This paper introduces a new way to test how well AI models can edit images based on complex instructions, going beyond simple changes like altering color. It highlights that current image editing AI isn't great at tasks requiring deeper understanding and reasoning.

What's the problem?

Existing tests for image editing AI mostly focus on changing one object at a time in realistic pictures. They don't really challenge the AI to understand how multiple objects interact, or to follow rules like those in a game. Also, these tests rely only on text descriptions to judge if the AI did a good job, which can be misleading when the task is complicated because the text might not fully capture what's needed.

What's the solution?

The researchers created a new benchmark called UniREditBench with 2,700 carefully designed image editing challenges. These challenges include both real-world and game-like scenarios, testing 8 different types of reasoning. Importantly, they don't just use text to check the AI's work; they also compare the edited image to a 'correct' image. They also built a large dataset, UniREdit-Data-100K, to help train AI models to perform better on these reasoning tasks, and used it to improve an existing model called Bagel, creating UniREdit-Bagel.

Why it matters?

This work is important because it provides a more thorough and reliable way to evaluate image editing AI. By identifying the strengths and weaknesses of different models, it helps researchers develop AI that can handle more complex and realistic image editing tasks, ultimately leading to more powerful and useful image manipulation tools.

Abstract

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

View Paper