Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Li Yuan
2025-10-21
Summary
This paper introduces a new way to improve how well computers can edit images based on text instructions, creating a system called UniWorld-V2.
What's the problem?
Current image editing models often memorize the specific examples they were trained on, making them bad at handling new or slightly different editing requests. They struggle to generalize beyond what they’ve already seen, and it’s hard to judge how 'good' an edited image is because there isn't one single standard for what a perfect edit looks like given different instructions.
What's the solution?
The researchers developed a framework called Edit-R1 that uses a technique called policy optimization to fine-tune image editing models *after* their initial training. This avoids memorization. They also use a powerful language model to act as a judge, giving feedback on the edits without needing to be specifically trained for each type of edit. To make this judging process more reliable, they added a filtering step to reduce errors in the language model’s feedback. This whole process is designed to work with many different starting image editing models, not just one specific one.
Why it matters?
This work is important because it significantly improves the ability of computers to edit images accurately and flexibly based on instructions. The fact that it works well with different base models means it can be widely adopted and used to enhance many image editing applications, leading to more powerful and user-friendly tools.
Abstract
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves state-of-the-art results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.