GARDO: Reinforcing Diffusion Models without Reward Hacking

Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan

2026-01-06

GARDO: Reinforcing Diffusion Models without Reward Hacking

Summary

This paper focuses on improving how we teach AI image generators to create pictures that accurately match text descriptions, using a technique called reinforcement learning.

What's the problem?

When training these AI image generators with reinforcement learning, it's hard to define exactly what 'good' looks like. So, we use a simpler, but imperfect, score as a guide. The AI can sometimes 'game' this score, getting a high score without actually creating a high-quality or diverse image. Existing methods to prevent this often slow down learning or limit the AI's ability to explore new, potentially better, image styles.

What's the solution?

The researchers developed a new method called GARDO. It smartly applies restrictions only when the AI is unsure about what to do, instead of constantly holding it back. GARDO also updates the 'guide' it uses for learning to keep up with the AI's progress, and it rewards the AI for creating images that are both high-quality *and* different from each other, preventing it from getting stuck making the same kinds of pictures over and over.

Why it matters?

GARDO is important because it allows AI image generators to learn more efficiently, create better images that truly match the text, and explore a wider range of creative possibilities without being tricked by flawed scoring systems. This means we can get more realistic and diverse images from these AI tools.

Abstract

Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

View Paper