EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control
Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, Saiyong Yang
2025-11-21
Summary
This paper focuses on a challenge in training really big language models – keeping them from getting stuck in bad habits during the learning process, and instead encouraging them to continue exploring better possibilities.
What's the problem?
When you're teaching a language model using a method called reinforcement learning, it's important to balance trying out new things (exploration) with sticking to what it already knows works (exploitation). A key measure of exploration is 'entropy'. The problem is that during training, the model learns from both successes and failures, and these successes and failures pull the entropy in opposite directions, making it hard to keep it at a good level. If entropy gets too low, the model stops exploring and gets stuck; if it's too high, learning is inefficient.
What's the solution?
The researchers developed a new technique called EntroPIC. Think of it like a volume knob for successes and failures. EntroPIC automatically adjusts how much weight is given to each type of feedback – successes and failures – to keep the entropy stable throughout the training process. It does this by carefully tuning how much each success or failure affects the overall learning signal, ensuring the model continues to explore effectively without getting stuck or wasting time.
Why it matters?
This is important because stable training is crucial for building powerful language models. By keeping the model exploring, EntroPIC helps it find better solutions and ultimately perform better on tasks like writing, translation, and answering questions. It provides a way to reliably train these large models, leading to more capable and useful AI systems.
Abstract
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.