LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

Weizhe Chen, Sven Koenig, Bistra Dilkina

2025-10-06

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

Summary

This paper explores a new way to improve how large language models, like those powering chatbots, learn to solve reasoning problems. It builds on a technique called reinforcement learning with verifiable rewards, which essentially trains the model by giving it positive feedback when it gets the right answer.

What's the problem?

Large language models sometimes struggle with reasoning tasks because they can 'overthink' – meaning they generate overly complex answers that aren't actually correct. Existing methods for improving these models through reinforcement learning don't always address this issue of response length and can be inefficient. The core problem is finding the best training examples to help the model learn without getting bogged down in unnecessary detail.

What's the solution?

The researchers developed a new algorithm called Length-aware Sampling for Policy Optimization, or LSPO. This algorithm cleverly chooses which training examples the model sees based on how long the model's responses typically are. If the model tends to write long, incorrect answers, LSPO will prioritize simpler examples. This dynamic selection process helps the model learn more effectively by focusing on examples that are appropriately sized for successful reasoning.

Why it matters?

This research is important because it offers a more efficient and effective way to train large language models to reason. By addressing the problem of 'overthinking,' LSPO can lead to models that provide more accurate and concise answers, improving their usefulness in a variety of applications like question answering, problem solving, and even creative writing. It also provides insights into how to best use length information to guide the learning process.

Abstract

Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

View Paper