Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang, Zhe Zhao

2026-04-09

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Summary

This paper focuses on how well AI systems, specifically the 'reward models' used to train large language models, can understand what *different* people want. It introduces a new way to test these reward models to see if they can handle personalized preferences.

What's the problem?

Currently, we're good at testing if AI responses are generally good – are they correct, helpful, and relevant? However, it's really hard to test if an AI can figure out what *you* specifically like, even if multiple answers are generally good. Existing tests don't really measure how well AI adapts to individual tastes and preferences, and it's unclear if current reward models can even do this well.

What's the solution?

The researchers created a new test called 'Personalized RewardBench'. They had people rate pairs of AI responses, but the key is that the better response wasn't necessarily 'better' overall, it was just the one that better matched *that person's* specific preferences, as defined by a set of rules. They made sure both responses were generally high quality, so the difference was truly about personal taste. They then tested existing AI reward models on this benchmark.

Why it matters?

This work is important because if AI is going to be truly helpful, it needs to understand what *you* want, not just what a general audience might want. The new test, Personalized RewardBench, is a much better way to measure how well AI can do this, and it actually predicts how well the AI will perform in real-world applications like choosing the best response or improving its answers through learning. It shows current AI struggles with personalization, highlighting an area needing improvement.

Abstract

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

View Paper