EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
2025-07-29
Summary
This paper talks about EDGE-GRPO, a new algorithm that improves how large language models learn to give diverse and accurate answers by fixing a problem called advantage collapse.
What's the problem?
The problem is that during training, models sometimes give the same or very similar answers for different questions, causing the training signals (called advantages) to collapse. This means the model stops improving because it doesn't get enough useful feedback to learn from diverse responses.
What's the solution?
EDGE-GRPO solves this by using two main ideas: Guided Error Correction, which ensures the model’s answers include both right and wrong examples to keep variety, and Entropy-Driven Advantage, which uses a measure of uncertainty to give different rewards based on confidence and correctness. Together, these help the model learn better by encouraging a mix of diverse and correct answers.
Why it matters?
This matters because better training means the language models become smarter, can reason through complex questions more accurately, and avoid repeating the same mistakes. It makes AI more reliable and effective for tasks like math problem solving and multi-step reasoning.
Abstract
The EDGE-GRPO algorithm addresses the advantage collapse problem in Group Relative Policy Optimization by incorporating entropy-driven advantage and guided error correction, enhancing response diversity and training signal.