The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

2025-11-25

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Summary

This paper focuses on improving how computers learn to generate realistic and appealing images using a technique called reinforcement learning. The core idea is to create a better way to 'reward' the computer when it creates a good image, so it learns to make even better ones.

What's the problem?

Currently, many image generation systems rely on 'reward functions' that try to guess what humans would like. These functions are often built using other AI models trained on human preferences, but they aren't very reliable. They can be tricked into thinking a bad image is good (called 'reward hacking') and don't always align with what people actually find visually pleasing. Existing methods to fix this, like tweaking the reward function, still have underlying biases that can hurt image quality or artistic style.

What's the solution?

The researchers developed a new framework called Adv-GRPO. It uses a clever 'adversarial' approach where both the image generator and the reward function are constantly improved together. The reward function learns from good example images and is designed to be harder to trick. Instead of just giving a single score, the reward function looks at the image itself using powerful AI vision models (like DINO) to provide detailed feedback. This detailed feedback helps the generator create images with better quality, aesthetics, and even allows for customizing the style of the images.

Why it matters?

This work is important because it addresses a fundamental challenge in AI image generation: how to accurately tell a computer what makes a good image. By creating a more robust and reliable reward system, the researchers have shown they can generate images that people consistently prefer over those created by existing methods, opening the door to more realistic and artistically satisfying AI-generated content.

Abstract

A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

View Paper