SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

2026-04-15

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Summary

This paper introduces a new method, Sequence-Level PPO (SPPO), for improving how large language models (LLMs) learn to reason and solve problems when given rewards for correct answers.

What's the problem?

When you try to teach LLMs to think step-by-step (using a technique called Chain-of-Thought) and reward them for getting the final answer right, it's hard to figure out *which* steps deserve credit. Traditional methods for updating the LLM based on rewards are unstable and require a lot of computer memory. Other methods avoid these problems but are very slow because they need to run the LLM many times to estimate how good each step is.

What's the solution?

SPPO simplifies the process by treating the entire reasoning process as a single decision, rather than a series of individual token choices. It uses a simpler way to estimate how good each reasoning 'sequence' is, avoiding the need to run the LLM multiple times. This makes it faster and more efficient while still being stable and effective.

Why it matters?

SPPO offers a way to train LLMs to reason more effectively without needing massive amounts of computing power. This is important because it makes it more practical to align these powerful models with human goals and values, especially for complex tasks like math problems.

Abstract

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

View Paper