ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson
2026-04-08
Summary
This paper introduces a new method called ThinkTwice to help large language models (LLMs) get better at solving complex reasoning problems, like math problems. It focuses on teaching the model not just to *find* an answer, but also to *check* and improve its own work.
What's the problem?
Large language models are pretty good at a lot of things, but they often struggle with problems that require multiple steps of logical thinking. They can make mistakes, and even when they get the right answer, they don't always know *why* it's right. Existing methods for improving these models often require a lot of extra information, like someone telling the model exactly what it did wrong, which is expensive and time-consuming to create.
What's the solution?
ThinkTwice works in two main steps. First, the model tries to solve the reasoning problem. Then, it immediately tries to improve its own solution to the *same* problem. Importantly, the model is only told whether its final answer is correct or not – it doesn’t get specific feedback on its steps. This process uses a technique called Group Relative Policy Optimization to efficiently train the model. The researchers tested ThinkTwice on several math benchmarks with different models and found it consistently outperformed other methods.
Why it matters?
This research is important because it shows a way to significantly improve the reasoning abilities of LLMs without needing a lot of human-provided feedback. The method encourages the model to learn from its own mistakes and build on its successes, leading to more reliable and accurate results. It suggests a more efficient and scalable way to train these powerful AI systems for complex tasks.
Abstract
We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.