The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
2025-05-29

Summary
This paper talks about how reinforcement learning, which is a way to train large language models to reason and solve problems, can run into trouble because the models become too confident and stop exploring new solutions.
What's the problem?
The problem is that during training, the model's 'entropy'—which is a measure of how much it explores different answers—drops really fast, causing the model to get stuck always picking the same kinds of answers and missing out on better ones. This makes the model's performance hit a ceiling and stops it from improving any further.
What's the solution?
To fix this, the researchers studied what causes entropy to collapse and found that it's linked to how the model updates its choices. They then created two simple methods, called Clip-Cov and KL-Cov, that limit how much the model can update its most confident choices, which keeps the entropy from dropping too quickly. This helps the model keep exploring new answers and leads to better results.
Why it matters?
This is important because it means we can train language models to be better at reasoning and problem-solving by making sure they keep exploring different options, which can lead to smarter and more reliable AI.
Abstract
Entropy dynamics in reinforcement learning with large language models are investigated to prevent policy entropy collapse and improve exploration.