RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasiia Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh
2024-09-25

Summary
This paper discusses a new method called Robust Reward Model (RRM) training, which improves how large language models (LLMs) learn from human feedback. It aims to reduce problems like reward hacking, where models exploit weaknesses in the reward system, and helps ensure that models better align with human preferences.
What's the problem?
Traditional methods for training reward models rely heavily on specific examples tied to prompts, which can lead to confusion between what humans actually want and irrelevant factors, such as how long a response is. This makes it hard for models to accurately understand and follow human preferences, leading to issues like reward hacking where the model finds shortcuts that don't align with true human intentions.
What's the solution?
To solve these problems, the researchers introduced a new framework that focuses on learning preferences without being influenced by irrelevant factors. They developed a causal framework and a data augmentation technique to filter out these unwanted artifacts. Their experiments showed that this new approach significantly improved the performance of the reward models, increasing accuracy and making them more reliable in real-world applications.
Why it matters?
This research is important because it enhances the training process for AI systems, making them more robust against potential issues that can arise from imperfect reward systems. By improving how LLMs learn from human feedback, this work helps create AI that better understands and aligns with human values, which is crucial for developing trustworthy and effective intelligent systems.
Abstract
Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.