Geometric-Mean Policy Optimization
Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
2025-07-29
Summary
This paper talks about Geometric-Mean Policy Optimization (GMPO), a new method to make training large language models more stable by changing how rewards are calculated during learning.
What's the problem?
The problem is that previous methods like Group Relative Policy Optimization (GRPO) sometimes cause unstable updates during training, especially when some parts of the training data have extreme or unusual importance. This makes the learning process unpredictable and can reduce the model's performance.
What's the solution?
GMPO solves this by using the geometric mean instead of the arithmetic mean to combine token-level rewards. This approach reduces the effect of extreme values and helps keep the training updates smooth and steady. By doing so, it improves the model's reasoning ability on math and multimodal tasks and performs better than previous methods.
Why it matters?
This matters because stable training leads to more reliable and powerful language models that can solve challenging problems better, especially in areas like math and understanding different types of information together.
Abstract
Geometric-Mean Policy Optimization (GMPO) stabilizes policy updates in large language models by maximizing the geometric mean of token-level rewards, outperforming Group Relative Policy Optimization (GRPO) on mathematical and multimodal reasoning benchmarks.