Single-stream Policy Optimization
Zhongwen Xu, Zihan Ding
2025-09-17
Summary
This paper focuses on improving how we train Large Language Models (LLMs) to make better decisions, specifically using a technique called policy gradients. It proposes a new method that simplifies the training process and leads to more accurate results.
What's the problem?
Current methods for training LLMs with policy gradients often group data together to reduce errors in the learning process. However, these groupings frequently become unhelpful, essentially canceling out the learning signal, and require the system to pause and wait for all groups to finish before continuing, which slows things down. This is especially problematic when dealing with complex tasks or when the time it takes to generate responses varies.
What's the solution?
The researchers introduce a new approach called Single-stream Policy Optimization (SPO). Instead of grouping data, SPO uses a continuously updated estimate of how good different actions are, adjusting this estimate based on how much the model's behavior changes. It also normalizes the learning signal across all the data at once, providing a more stable and consistent learning experience. Because it doesn't rely on groups, SPO can process information much faster and handle situations where some tasks take longer than others.
Why it matters?
This work is important because it shows that simplifying the training process, rather than adding more complex workarounds, can lead to significant improvements in LLM performance. The results demonstrate that SPO outperforms existing methods on challenging math problems, achieving higher accuracy and a smoother learning process. It suggests that focusing on fundamental principles of reinforcement learning is key to advancing LLM reasoning capabilities.
Abstract
We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.