Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin

2025-12-02

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Summary

This paper investigates how to effectively use large language models in reinforcement learning, a type of machine learning where an 'agent' learns to make decisions by trial and error to maximize a reward.

What's the problem?

When you try to train a large language model using reinforcement learning, it's often unstable and doesn't learn well. The standard methods for training these models, like REINFORCE, rely on estimating how good each action is, but this estimate can be inaccurate, especially when the model changes a lot during training. This inaccuracy leads to the learning process going haywire, and it's not always clear *why* certain training tricks help.

What's the solution?

The researchers figured out that the accuracy of the action estimates depends on two key things: how much the model's behavior changes between training and when it's actually used (the 'training-inference discrepancy'), and how 'outdated' the model's knowledge is (the 'policy staleness'). They used math to show that if you keep these two things small, the standard training methods work much better. They then tested this idea with a huge 30 billion parameter language model, finding that using 'importance sampling' helps when the model doesn't change much between updates, and combining 'clipping' and a technique called 'Routing Replay' is crucial when updates are more frequent and the model is changing rapidly.

Why it matters?

This work provides a theoretical understanding of why certain techniques stabilize reinforcement learning with large language models. It's important because it gives researchers a principled way to improve training, rather than just relying on trial and error. The findings and recommended training methods will help others build more reliable and powerful AI systems that can learn from experience.

Abstract

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

View Paper