PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Chen Huang, Wei Lu, Wenxuan Zhang

2025-10-14

PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Summary

This paper focuses on making large reasoning models, which are good at solving complex problems by explaining their thinking, more efficient. These models often give overly long and wordy explanations, which costs more to run and are harder to use.

What's the problem?

Large reasoning models are great at explaining *how* they arrive at an answer, but these explanations are often too long and contain unnecessary steps. Simply shortening the responses often hurts the accuracy of the answer. The challenge is to find a way to make these explanations concise without sacrificing the model’s ability to solve the problem correctly.

What's the solution?

The researchers discovered that the model’s ‘uncertainty’ (measured as entropy) changes during the reasoning process. When the model is thinking through the problem, it’s more uncertain and generates longer responses. When it’s giving the final answer, it’s more certain and the response is shorter. They developed a new method called Phase Entropy Aware Reward (PEAR) that encourages the model to be less uncertain during the thinking phase, leading to shorter explanations, but still allows some uncertainty when finalizing the answer to ensure accuracy. PEAR essentially guides the model to be more focused and direct in its reasoning.

Why it matters?

This work is important because it provides a way to improve the practicality of large reasoning models. By making them more concise, we can reduce the computational cost of using them and make them easier for people to understand. The method also works well even when presented with problems it hasn’t seen before, making it a robust solution for real-world applications.

Abstract

Large Reasoning Models (LRMs) have achieved impressive performance on complex reasoning tasks by generating detailed chain-of-thought (CoT) explanations. However, these responses are often excessively long, containing redundant reasoning steps that inflate inference cost and reduce usability. Controlling the length of generated reasoning without sacrificing accuracy remains an open challenge. Through a systematic empirical analysis, we reveal a consistent positive correlation between model entropy and response length at different reasoning stages across diverse LRMs: the thinking phase exhibits higher entropy, reflecting exploratory behavior of longer responses, while the final answer phase shows lower entropy, indicating a more deterministic solution. This observation suggests that entropy at different reasoning stages can serve as a control knob for balancing conciseness and performance. Based on this insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly. This enables adaptive control of response length without relying on explicit length targets or rigid truncation rules. Extensive experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales. In addition, PEAR demonstrates strong out-of-distribution (OOD) robustness beyond the training distribution. Our code is available at: https://github.com/iNLP-Lab/PEAR.

View Paper