Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou

2025-12-08

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Summary

This paper focuses on improving how large language models are refined after their initial training, specifically using a technique called reinforcement learning. The goal is to make these models both more capable and better aligned with what humans want them to do.

What's the problem?

Reinforcement learning often involves the model learning from its own experiences, but this can be unstable. When the model tries something new, it can quickly move too far away from what it already knows works well, leading to unpredictable behavior and difficulty in training. Existing methods, like PPO-Clip, try to prevent this, but they mainly focus on individual actions and don't account for how the overall range of actions the model considers is changing.

What's the solution?

The researchers introduced a new way to measure how much the model's exploration is changing with each update, using something called the 'entropy ratio'. This ratio compares how diverse the model's actions are now versus how diverse they were before. They then created a system called 'Entropy Ratio Clipping' (ERC) that keeps this ratio within reasonable bounds, preventing the model from drastically changing its behavior. This ERC method was added to existing reinforcement learning algorithms, DAPO and GPPO.

Why it matters?

This work is important because it makes reinforcement learning for large language models more stable and reliable. By controlling the overall change in the model's exploration, it allows for more consistent improvements in performance and helps ensure the model stays on track during training, ultimately leading to better and more predictable AI systems.

Abstract

Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an Entropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.

View Paper