GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
2025-10-29
Summary
This paper focuses on improving how artificial intelligence learns to create things, like images, by using a technique called reinforcement learning with something called 'flow-matching models'. It specifically addresses a problem where the AI gets *too* good at maximizing a reward signal, but in doing so, the actual quality of what it creates gets worse.
What's the problem?
When AI systems are trained using this method, they sometimes become overly confident in their decisions, leading to updates that are too large. A key part of the training process, called 'importance-ratio clipping', is supposed to prevent this, but it wasn't working correctly. The paper found that the way the AI calculates how important different updates are was consistently skewed, meaning it wasn't effectively limiting those overly confident changes. This resulted in the AI focusing too much on the reward and ignoring important details like image clarity or how well an image matches a text description.
What's the solution?
The researchers developed a new method called 'GRPO-Guard' to fix this. It works in two main ways: first, it 'normalizes' the importance ratios to make them more balanced and consistent across different stages of the creation process. Second, it adjusts the strength of updates based on the amount of 'noise' present, preventing certain stages from dominating the learning process. Essentially, GRPO-Guard acts like a more reliable safety switch for the AI's updates, preventing it from over-optimizing and losing sight of the overall quality.
Why it matters?
This work is important because it makes these AI image and text generation systems much more practical. Without a fix for this over-optimization problem, the AI could create things that score highly based on the reward but are actually poor quality or don't meet the user's needs. GRPO-Guard allows the AI to learn more effectively and produce better results, making these technologies more useful in real-world applications.
Abstract
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.