Leveraging Verifier-Based Reinforcement Learning in Image Editing
Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang
2026-05-01
Summary
This paper focuses on improving how well AI can edit images based on what people ask for, specifically using a technique called Reinforcement Learning from Human Feedback. It introduces a new system, Edit-R1, designed to better understand and evaluate image edits.
What's the problem?
Currently, AI struggles with image editing because the systems used to judge how good an edit is aren't very detailed. They give a general score without really checking if the edit actually follows all the instructions. This leads to edits that might look okay overall but miss important details or have biases, because the AI doesn't understand *why* an edit is good or bad.
What's the solution?
The researchers created Edit-R1, which doesn't just score an edit, but *reasons* about it. It breaks down the editing instructions into smaller parts, checks if the edited image meets each part, and then combines those checks into a more useful and understandable score. They first 'teach' the system with examples, then use a special learning process called Group Contrastive Preference Optimization to refine it based on what people prefer. This improved scoring system is then used to train the image editing AI itself.
Why it matters?
This work is important because it makes image editing AI much more accurate and reliable. By providing a more detailed and thoughtful way to evaluate edits, the AI can learn to better follow instructions and create images that people actually want. The system also performs better as it gets larger, and it improves the performance of existing image editing models, showing it's a valuable tool for the field.
Abstract
While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.