Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang

2026-03-13

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Summary

This paper focuses on improving how well artificial intelligence creates and edits images based on text instructions, specifically by making the 'critic' part of the AI – the part that judges how good the image is – more reliable.

What's the problem?

Currently, when AI tries to generate or edit images using a process called reinforcement learning, the 'critic' often makes mistakes and gives inaccurate scores. It essentially 'hallucinates' and thinks an image is good when it isn't, or vice versa. This bad feedback throws off the whole process and leads to images that don't quite match what was asked for or aren't very realistic.

What's the solution?

The researchers developed a new system called FIRM, which stands for Faithful Image Reward Modeling. They started by creating two large, high-quality datasets specifically for training the 'critic' – one for editing images and one for generating them from text. These datasets were built with careful checks for things like whether edits actually make sense and whether generated images follow the given instructions. They then trained new 'critic' models, FIRM-Edit-8B and FIRM-Gen-8B, using these datasets. Finally, they created a new way to combine different types of feedback to guide the AI, called 'Base-and-Bonus', to balance different goals like consistency and quality.

Why it matters?

This work is important because it significantly improves the quality and accuracy of AI-generated and edited images. By making the 'critic' more reliable, the AI can create images that are more faithful to the original instructions and look more realistic, setting a new standard for image generation and editing technology.

Abstract

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

View Paper