< Explain other AI papers

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, Lin Yan

2025-03-31

Exploring Data Scaling Trends and Effects in Reinforcement Learning from
  Human Feedback

Summary

This paper is about improving how AI learns from human feedback to better match what people want.

What's the problem?

AI can sometimes find loopholes in the feedback system and give weird or repetitive answers, and it can also struggle to understand what people really want.

What's the solution?

The researchers developed a better system for giving feedback to AI that combines different ways of checking its answers and also makes sure it gives a variety of responses.

Why it matters?

This work matters because it can help AI learn to be more helpful and aligned with human values, leading to better AI assistants and other applications.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.