G^2RPO: Granular GRPO for Precise Reward in Flow Models

Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai

2025-10-09

G^2RPO: Granular GRPO for Precise Reward in Flow Models

Summary

This paper focuses on improving how we teach AI image generators to create pictures that people actually like, by using a technique called reinforcement learning. It builds on existing methods that use randomness to explore different image possibilities, but aims to make the process more accurate and reliable.

What's the problem?

Current methods for aligning AI image generators with human preferences struggle because the feedback signal – the 'reward' telling the AI what's good – is often weak and doesn't fully capture what people want. The AI explores many image variations, but it's hard to pinpoint exactly *why* some are better than others, leading to images that aren't perfectly aligned with human tastes. The feedback is too sparse and doesn't give enough detailed guidance.

What's the solution?

The researchers developed a new framework called G^2RPO. It works in two main ways: first, it carefully introduces random changes to the image generation process and makes sure those changes are closely linked to the reward signal, so the AI understands what effect each change has. Second, it looks at the image at different levels of detail – multiple 'diffusion scales' – to get a more complete picture of how good it is, instead of relying on just one assessment. This gives a more robust and accurate evaluation of each image variation.

Why it matters?

This research is important because it makes AI image generators better at creating images that people genuinely prefer. By providing a more precise and comprehensive way to evaluate image quality, it moves us closer to AI systems that can reliably generate content tailored to individual tastes, which has implications for art, design, and many other creative fields.

Abstract

The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G^2RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G^2RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.

View Paper