Self-Reflective Generation at Test Time

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu

2025-10-07

Summary

This paper introduces a new method called Self-Reflective Generation at Test Time, or SRGen, which helps large language models (LLMs) think more reliably when solving complex problems.

What's the problem?

Large language models are getting better at reasoning, but they can be easily thrown off track by making a small mistake early on in their thought process. Because these models generate text one piece at a time, an error at the beginning can snowball and lead to a completely wrong answer. Existing methods to fix these errors either require rewriting entire responses or need a lot of extra training data, making them slow and inefficient.

What's the solution?

SRGen tackles this by having the model pause and 'reflect' on its work *while* it's generating an answer, specifically when it's feeling uncertain about what to say next. It identifies these uncertain points by looking at how confident the model is in its predictions. When uncertainty is high, SRGen adjusts the probabilities of the next possible words, essentially correcting itself based on what it has already written. It does this without needing any additional training – it works directly with the model as it’s being used.

Why it matters?

This research is important because it provides a way to make LLMs more trustworthy and accurate without requiring significant computational resources or large datasets. SRGen is easy to add to existing models and can be combined with other techniques to further improve performance, making it a practical step towards more reliable AI reasoning.

Abstract

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

View Paper