Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

2025-08-29

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Summary

This paper focuses on improving how we train AI models to create images from text, specifically addressing issues with how the AI is 'rewarded' for good images and how we measure the quality of those images.

What's the problem?

Current AI systems that generate images from text rely on a scoring system to learn what makes a 'good' image. However, this scoring system can be tricked. Even tiny differences in scores, when adjusted for comparison, can be blown out of proportion, leading the AI to focus on unimportant details and ultimately create unstable or weird images. This is called 'reward hacking'. Also, the standard tests used to evaluate these image-generating AIs aren't detailed enough to really show how well they're doing.

What's the solution?

The researchers developed a new training method called Pref-GRPO. Instead of trying to maximize a score, this method focuses on the AI learning to consistently *prefer* better images over worse ones. It does this by showing the AI pairs of images and rewarding it for choosing the one that looks better. They also created a new, much more thorough benchmark called UniGenBench, which uses 600 different text prompts and 37 specific criteria to evaluate image quality, using another AI to help with the assessment.

Why it matters?

This work is important because it makes image generation AI more reliable and capable of creating higher-quality images. By fixing the reward system and providing a better way to test these models, we can push the boundaries of what AI can create and better understand the strengths and weaknesses of different image generation systems, both those publicly available and those kept private.

Abstract

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

View Paper