Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu

2026-02-06

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Summary

This paper investigates why different methods for improving Large Language Models (LLMs) and Vision-Language Models (VLMs) using Reinforcement Learning with Verifiable Rewards (RLVR) lead to varying response lengths, and then proposes a new method to address this issue.

What's the problem?

When using RLVR to make LLMs and VLMs better at reasoning, researchers noticed that the length of the responses generated by the models changes during training. Some methods encourage longer responses, which often improves reasoning, but the *way* these response lengths change differs depending on the specific RLVR technique used. The core problem is that some algorithms unintentionally favor certain response lengths, potentially leading to responses that are either too short or unnecessarily long, hindering performance. Specifically, a common algorithm called Group Sequence Policy Optimization (GSPO) has a built-in bias towards certain lengths.

What's the solution?

The researchers analyzed the inner workings of popular RLVR algorithms to understand why response length changes happen. They discovered that the bias in algorithms like GSPO is due to how the algorithm calculates its 'loss' – essentially, how it learns from mistakes. To fix this, they created a new algorithm called Length-Unbiased Sequence Policy Optimization (LUSPO). LUSPO modifies the loss function to remove the length bias, allowing the model to learn more effectively without being pushed towards specific response lengths. They tested LUSPO on math and multimodal reasoning tasks.

Why it matters?

This work is important because it provides a fundamental understanding of how response length impacts reasoning in LLMs and VLMs. By developing LUSPO, a new optimization strategy, they’ve created a method that consistently outperforms existing techniques. This means we can build AI models that are better at complex reasoning tasks, and it offers a more reliable way to improve these models without unintended consequences related to response length.

Abstract

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

View Paper