Self-rewarding correction for mathematical reasoning
Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
2025-02-28
Summary
This paper talks about teaching AI language models to check and correct their own work, especially when solving math problems. The researchers developed a method that allows these models to both solve problems and evaluate their solutions without needing help from other systems.
What's the problem?
Usually, when AI models solve complex problems, they need another system to check their work and tell them if they're right or wrong. This is inefficient and can be complicated when using these models in real-world applications.
What's the solution?
The researchers created a two-step process to teach AI models to check and correct themselves. First, they made the AI generate lots of examples of problem-solving, including examples where it catches and fixes its own mistakes. Then, they used these examples to train the AI. In the second step, they used a technique called reinforcement learning to make the AI even better at judging its own work and fixing errors.
Why it matters?
This matters because it could make AI systems more independent and reliable. If AI can check and correct its own work, especially in areas like math where accuracy is crucial, it could be used more confidently in education, research, and many other fields. This self-correcting ability could also make AI systems more efficient and easier to use in various applications, potentially leading to new advancements in how we use AI to solve complex problems.
Abstract
We study <PRE_TAG>self-rewarding reasoning large language models (LLMs)</POST_TAG>, which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and <PRE_TAG>self-correction mechanisms</POST_TAG>. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic <PRE_TAG>self-correction</POST_TAG> capabilities and achieves performance comparable to systems that rely on external reward models.