Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin

2025-11-26

Summary

This paper focuses on improving how we train large language models (LLMs) using a technique called reinforcement learning. Specifically, it addresses the challenges of making this training process stable and effective.

What's the problem?

When using reinforcement learning to improve LLMs, the updates to the model can be very unstable. This is because the importance of each piece of text (token) used for learning can vary wildly, leading to big swings in the model's behavior. This problem is even worse in more complex LLMs that use a 'mixture of experts' approach. Previous methods tried to fix this by harshly limiting updates, but this often meant the model couldn't learn as well because useful information was thrown away.

What's the solution?

The researchers developed a new method called Soft Adaptive Policy Optimization (SAPO). Instead of abruptly cutting off updates, SAPO uses a more gentle approach. It smoothly reduces the impact of updates that are very different from the model's current understanding, while still allowing it to learn from updates that are more reasonable. Think of it like a volume knob instead of an on/off switch. This allows SAPO to be both consistent across entire sequences of text and adaptable to individual tokens within those sequences, preserving more learning signals.

Why it matters?

SAPO is important because it provides a more reliable and efficient way to train LLMs with reinforcement learning. The experiments showed that SAPO leads to more stable training and better performance on challenging tasks, like mathematical reasoning. It was also successfully used to improve a series of models called Qwen3-VL, demonstrating its broad applicability and scalability, meaning it works well with different model sizes and tasks.

Abstract

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

View Paper