Taming Overconfidence in LLMs: Reward Calibration in RLHF
Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
2024-10-17

Summary
This paper discusses a method to reduce overconfidence in large language models (LLMs) that are trained using Reinforcement Learning from Human Feedback (RLHF), by calibrating their reward systems.
What's the problem?
Large language models often show overconfidence in their responses, meaning they may express high certainty even when their answers are not accurate. This overconfidence can mislead users and reduce trust in the model's outputs. The current training methods, particularly those using RLHF, tend to reinforce this issue by favoring high-confidence scores without ensuring those scores reflect the actual quality of the responses.
What's the solution?
To address this problem, the authors propose two new versions of the Proximal Policy Optimization (PPO) algorithm: PPO-M and PPO-C. PPO-M incorporates explicit confidence scores into the training of reward models, helping to align the model's confidence with the quality of its responses. PPO-C adjusts the reward score based on the difference between current and past rewards, which helps to smooth out fluctuations in confidence. Both methods can be easily integrated into existing training processes without needing extra labeled data.
Why it matters?
This research is important because it helps improve the reliability of AI systems that use language models. By calibrating how confident these models are in their answers, developers can create more trustworthy AI tools that provide better information and enhance user experience. This is especially crucial in applications like customer service, education, and healthcare, where accurate information is vital.
Abstract
Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.