Pre-Trained Policy Discriminators are General Reward Models

Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen

2025-07-08

Pre-Trained Policy Discriminators are General Reward Models

Summary

This paper talks about POLAR, a new way to train reward models in reinforcement learning. Instead of giving absolute scores, POLAR trains models to tell the difference between two policies, ranking them based on which is better.

What's the problem?

The problem is that traditional reward models rely on human preferences and absolute scoring, which can be hard to scale and may not generalize well across different tasks or settings.

What's the solution?

The researchers developed POLAR, which views reward modeling as a policy discrimination task. It trains the model to recognize and rank differences between behaviors, using large synthetic datasets. This approach improves accuracy and generalization while reducing dependence on human data.

Why it matters?

This matters because better reward models make reinforcement learning more effective and scalable, helping AI systems learn stronger and more reliable behaviors in various complex tasks.

Abstract

A novel reward modeling approach, Policy Discriminative Learning (POLAR), enhances reward model performance and generalization in reinforcement learning by focusing on relative policy differences.

View Paper