Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li

2025-02-27

Agentic Reward Modeling: Integrating Human Preferences with Verifiable
Correctness Signals for Reliable Reward Systems

Summary

This paper talks about a new system called Agentic Reward Modeling, which combines human preferences with factual checks to make AI models more reliable and accurate in their responses.

What's the problem?

Current AI reward systems mostly focus on what humans prefer, but they often ignore whether the AI's answers are factually correct or follow specific instructions. This can lead to unreliable or biased results, especially when the AI is used in important tasks.

What's the solution?

The researchers created a system called RewardAgent that combines human preferences with two key checks: factual accuracy and whether the AI follows instructions. They tested this system on different tasks and found it performed much better than older reward models. They also used RewardAgent to train an AI model, which showed improved performance on language benchmarks.

Why it matters?

This matters because it helps create AI systems that are not only aligned with what people want but also provide correct and trustworthy information. This approach could make AI more reliable for real-world applications, like answering questions or making decisions, while reducing errors and biases.

Abstract

Reward models (RMs) are crucial for the training and inference-time scaling up of <PRE_TAG>large language models (LLMs)</POST_TAG>. However, existing reward models primarily focus on human preferences, neglecting <PRE_TAG>verifiable correctness signals</POST_TAG> which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named <PRE_TAG>RewardAgent</POST_TAG>, that combines human preference rewards with two verifiable signals: <PRE_TAG>factuality</POST_TAG> and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. <PRE_TAG>RewardAgent</POST_TAG> significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using <PRE_TAG>RewardAgent</POST_TAG> and train an LLM with the DPO objective, achieving superior performance on various <PRE_TAG>NLP benchmarks</POST_TAG> compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).

View Paper