Reward-Robust RLHF in LLMs
Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, Yuan Shen
2024-09-25

Summary
This paper discusses a new framework called reward-robust RLHF, which improves how large language models (LLMs) learn from human feedback. It addresses issues related to the instability of reward models that can lead to problems like reward hacking and misalignment with human goals.
What's the problem?
As LLMs become more advanced, they need to learn effectively from human feedback to align their behavior with human intentions. However, existing methods that rely on reward models can be unstable and imperfect, leading to issues where the model might optimize for unintended outcomes (reward hacking) or fail to understand what humans really want. This makes it challenging to develop reliable AI systems.
What's the solution?
To tackle these challenges, the researchers introduced a reward-robust RLHF framework. This framework uses a new optimization approach that balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME). This means it can handle uncertainty in how rewards are assigned, ensuring that the model learns more stably even when the reward models are not perfect. The results showed that this new approach consistently outperformed traditional RLHF methods in various tests, demonstrating improved accuracy and stability over time.
Why it matters?
This research is significant because it enhances the reliability of LLMs in real-world applications by making them more robust to changes and imperfections in reward systems. By ensuring that these models can better align with human intentions, the reward-robust RLHF framework paves the way for developing more trustworthy AI systems, which is crucial as we move closer to achieving Artificial General Intelligence (AGI).
Abstract
As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect reward models. Empirical results demonstrate that our framework consistently outperforms traditional RLHF across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be effective in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment with RLHF.