AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin
2025-02-20
Summary
This paper talks about AdaptiveStep, a new way to improve how AI models learn to solve problems step-by-step. It's like teaching a computer to show its work in math class, but in a smarter way that focuses on the important parts of the problem-solving process.
What's the problem?
Current methods for training AI to solve problems step-by-step use fixed rules to break down the process, like using specific words or a set number of steps. This isn't very flexible and doesn't always capture the real important parts of solving a problem. It's like forcing someone to explain their thinking in exactly five sentences, even if some problems need more or less explanation.
What's the solution?
The researchers created AdaptiveStep, which looks at how confident the AI is when it's figuring out the next word in its solution. When the AI's confidence drops, that's where AdaptiveStep marks a new step in the reasoning process. This way, the steps are divided at points where the AI is making important decisions, not just at arbitrary points. They tested this method on math problems and coding tasks, and found that it worked better than older methods.
Why it matters?
This matters because it could make AI better at explaining how it solves complex problems, which is crucial for tasks like math, science, and coding. By improving how AI learns to reason, we could create more trustworthy and understandable AI systems. This could lead to better AI tutors, more reliable automated problem-solving tools, and AI that can explain its decisions in a way that makes sense to humans. It's a step towards AI that doesn't just give answers, but can show its work in a meaningful way.
Abstract
Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.