First Try Matters: Revisiting the Role of Reflection in Reasoning Models

Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, Lidong Bing

2025-10-10

First Try Matters: Revisiting the Role of Reflection in Reasoning Models

Summary

This paper investigates how helpful the 'thinking steps' are in large language models when they're solving math problems. These models are getting better at reasoning, and it's often thought that this is because they can work through problems in a step-by-step way, even revisiting their answers. This research tries to figure out if those revisits actually improve the final result.

What's the problem?

While large language models seem to improve with more reasoning steps, it wasn't clear *why*. Are they actually correcting mistakes during those extra steps, or are they just confirming their initial answer? The researchers wanted to understand if the 'reflective' process – where a model continues to think even after giving an answer – actually leads to more accurate solutions, or if it's just extra work.

What's the solution?

The researchers analyzed how eight different language models solved math problems, paying close attention to those reflective steps. They found that models rarely changed their initial answer during reflection; they mostly just confirmed it. They also trained models using data with varying amounts of these reflective steps and discovered that more steps mainly helped the model get the *first* answer right, not correct wrong answers. Based on this, they developed a method to stop the model from thinking so much once it has a good answer, and another method to cut off the reflections once a plausible answer appears, making the process more efficient.

Why it matters?

This research shows that the extra 'thinking' large language models do isn't always helpful. It suggests that we can make these models faster and more efficient by stopping them from endlessly reflecting on an answer once a reasonable solution is found. This is important because these models are expensive to run, and reducing the number of steps they take can save time and resources without significantly sacrificing accuracy.

Abstract

Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.

View Paper