On-Policy RL with Optimal Reward Baseline

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei

2025-05-30

On-Policy RL with Optimal Reward Baseline

Summary

This paper talks about a new reinforcement learning method called OPO that helps train large language models to be more stable and accurate, especially when teaching them to follow instructions or reason through problems.

What's the problem?

The problem is that when training language models using reinforcement learning, the process can be unstable and unpredictable, which makes it hard to get models that consistently give good answers and follow directions well.

What's the solution?

The researchers introduced OPO, a new algorithm that focuses on training the model only with the most recent and relevant experiences, and uses a carefully chosen reward system to guide learning. This makes the training process smoother and helps the model align better with what people want.

Why it matters?

This is important because it leads to language models that are more reliable and trustworthy, making them better at understanding what users need and providing helpful, accurate responses.

Abstract

A novel reinforcement learning algorithm, OPO, improves training stability and performance in large language model alignment and reasoning by emphasizing exact on-policy training and using an optimal reward baseline.

View Paper