SpotEdit: Evaluating Visually-Guided Image Editing Methods

Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer

2025-08-26

SpotEdit: Evaluating Visually-Guided Image Editing Methods

Summary

This paper introduces SpotEdit, a new way to thoroughly test how well image editing tools work when given both a picture and text instructions.

What's the problem?

Currently, testing these image editing tools isn't very rigorous; the tests are too simple and don't reflect the kinds of challenges you'd face in real life. A big issue is that even advanced models like GPT-4o sometimes 'hallucinate' – meaning they think they see something in the image that isn't actually there, and then try to edit based on that false information.

What's the solution?

The researchers created SpotEdit, a detailed benchmark with various tests to evaluate different image editing models, including those using diffusion, autoregressive, and hybrid techniques. This benchmark specifically focuses on identifying when and how often these models hallucinate visual cues and make incorrect edits as a result. They've also made the code and benchmark available for others to use.

Why it matters?

This work is important because it provides a much more realistic and comprehensive way to measure the performance of image editing tools. By highlighting the problem of hallucination, it helps developers improve these models and make them more reliable for real-world applications, ensuring they actually edit what you *intend* them to, and not something they imagine.

Abstract

Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.

View Paper