Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
2025-07-25
Summary
This paper talks about Group Sequence Policy Optimization (GSPO), a new reinforcement learning algorithm designed to help train large language models more efficiently and stably by focusing on entire sequences instead of individual words.
What's the problem?
Previous reinforcement learning methods used token-level importance ratios which caused instability and made training large models like Mixture-of-Experts (MoE) very difficult, sometimes leading to model failure.
What's the solution?
GSPO uses sequence-level importance ratios based on how likely a whole response is, applying clipping and rewards at the sequence level. This approach stabilizes training, improves efficiency, and performs better than older methods like GRPO, especially for large-scale models.
Why it matters?
This matters because GSPO makes training powerful language models more reliable and efficient, paving the way for smarter AI systems like the latest Qwen3 models and simplifying the infrastructure needed for reinforcement learning.
Abstract
Group Sequence Policy Optimization (GSPO) is a reinforcement learning algorithm that improves training efficiency and performance of large language models by using sequence-level importance ratios and operations.