Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo
2025-10-07
Summary
This paper focuses on improving how large language models, like those powering chatbots, learn to 'think' through complex problems using a technique called reinforcement learning.
What's the problem?
When training these models with reinforcement learning, a common method involves letting the model try things out and then giving it feedback. However, early in training, the model is bad at trying things, leading to unreliable feedback and unstable learning. It's like trying to learn to ride a bike when you keep falling over – it's hard to adjust when everything is wobbly and unpredictable.
What's the solution?
The researchers developed a new method called Slow-Fast Policy Optimization. It works by breaking down the learning process into three steps: first, the model takes a few quick 'test' steps to get a feel for the problem, then it adjusts its approach to stay on track, and finally, it makes a more careful, slower correction based on the test results. This approach helps stabilize the learning process without changing how the model actually learns.
Why it matters?
This new method is important because it makes training these reasoning models faster and more reliable. It requires fewer attempts for the model to learn, and it achieves better results on challenging reasoning tasks, like solving math problems. Ultimately, this means we can build smarter AI systems that are better at complex problem-solving.
Abstract
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 fewer rollouts and a 4.19 reduction in wall-clock time to match GRPO's best accuracy.