RewardDance: Reward Scaling in Visual Generation
Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang
2025-09-11
Summary
This paper introduces a new way to train reward models for AI systems that create images and videos, making them better at following instructions and producing high-quality results.
What's the problem?
Currently, training these reward models is difficult because existing methods don't work well with the way image and video generation models are built. They either have limitations in how they process information or don't accurately reflect what makes a good image or video. A big issue is 'reward hacking,' where the AI finds loopholes to get a high score without actually improving the quality of its creations.
What's the solution?
The researchers developed a framework called RewardDance. It works by having the AI predict whether a generated image or video is better than another, essentially giving a 'yes' or 'no' answer. This approach aligns better with how these generation models work and allows for building much larger and more capable reward models. They also incorporated the ability to give specific instructions, show examples, and even use step-by-step reasoning to guide the reward model.
Why it matters?
RewardDance solves the problem of reward hacking and allows for the creation of reward models that are much better at evaluating and improving the quality of generated images and videos. This is important because it leads to AI systems that can create more diverse, realistic, and useful visual content, and avoids the AI simply 'gaming' the system to get a good score without actually improving.
Abstract
Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.