Mitigating Overthinking through Reasoning Shaping

Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang

2025-10-13

Mitigating Overthinking through Reasoning Shaping

Summary

This paper investigates a problem with powerful AI reasoning models: they often get stuck in overly complex thought processes, using a lot of computing power without necessarily improving the answer. The researchers propose a new method to encourage these models to be more concise and efficient in their reasoning.

What's the problem?

Large AI models are getting better at solving problems, especially when trained with a technique called Reinforcement Learning from Verifier Reward. However, these models tend to 'overthink' – they generate very long, rambling explanations that take a lot of time and resources. Previous attempts to shorten these explanations often hurt the model's ability to actually solve the problem correctly. The issue is that simply penalizing the model for each individual word isn't smart enough; it needs a more nuanced approach.

What's the solution?

The researchers developed a new technique called Group Relative Segment Penalization, or GRSP. Instead of looking at each word individually, GRSP focuses on 'segments' of reasoning – complete ideas or steps in the problem-solving process. It then penalizes the model based on the length of these segments, but in a way that considers how important each segment seems to be. They figured out that the length of reasoning segments is closely tied to both how much computing power is used and how well the model performs, so they created a system to weight these segments accordingly.

Why it matters?

This research is important because it helps make powerful AI models more practical. By making these models more efficient, we can reduce the cost of running them and make them accessible to more people. The GRSP method specifically improves performance on difficult problems, and it also makes the training process more stable, meaning the AI learns more reliably. This is a step towards building AI that can solve complex problems without wasting resources.

Abstract

Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.

View Paper