GEditBench v2: A Human-Aligned Benchmark for General Image Editing
Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen
2026-03-31
Summary
This paper introduces a new way to test how well image editing programs are doing, focusing on whether the changes they make actually look good and make sense.
What's the problem?
Currently, testing image editing programs is difficult because the existing tests aren't very comprehensive and don't accurately measure how visually consistent the edited images are with the original. Visual consistency means things like making sure a person's identity doesn't change when you edit their picture, or that the structure of a scene remains logical after edits. Standard ways of measuring image quality just don't capture these nuances well.
What's the solution?
The researchers created GEditBench v2, a much larger and more diverse set of editing tasks – over 1,200 real-world requests covering 23 different things people might want to do to an image, including edits that aren't specifically pre-defined. They also developed PVC-Judge, a computer program that can evaluate how visually consistent an edited image is compared to the original, and they trained it using a new method to create good training data. They even built a test set, VCReward-Bench, to make sure PVC-Judge agrees with what humans think looks good.
Why it matters?
This work is important because it provides a better way to evaluate image editing models. By revealing the weaknesses of current models with this new benchmark, it will help researchers build better, more reliable image editing tools that produce more realistic and visually pleasing results. It also shows that their new evaluation tool, PVC-Judge, is even better than some of the most advanced existing systems.
Abstract
Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.