EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing

Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou

2025-09-19

EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing

Summary

This paper introduces a new way to automatically check how well image editing programs are doing, especially when you give them a series of instructions one after another.

What's the problem?

Currently, evaluating image editing is tough. One method needs perfect 'answer key' images which are hard to create and can be biased. Another uses AI to judge the edits, but that AI isn't always accurate when deciding if the edits actually follow the instructions, look good, and make sense with the original image.

What's the solution?

The researchers created 'EdiVal-Agent,' a system that breaks down images into objects, then creates realistic editing instructions. It then uses AI, combined with tools that specifically identify objects in images, to check if the edits followed the instructions, if the changes make sense for the image's content, and if the final result looks visually appealing. They also built a large set of images and editing tasks, called 'EdiVal-Bench,' to test different editing programs.

Why it matters?

This work is important because it provides a more reliable and detailed way to test image editing AI. By pinpointing where these programs struggle, it helps developers build better editing tools in the future, moving beyond just hoping they work and actually understanding their weaknesses.

Abstract

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline's modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.

View Paper