DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning
Yuanhao Wu, Juntong Song, Hanning Zhang, Tong Zhang, Cheng Niu
2025-06-27
Summary
This paper talks about DuaShepherd, a new system that helps large AI models get better at solving math problems by rewarding them not just for the final answer but also for making correct steps and following good paths along the way.
What's the problem?
The problem is that AI models often only focus on whether their final math answer is right or wrong, which doesn’t help them learn from the process or improve their reasoning step by step.
What's the solution?
The researchers created a framework where the AI receives two types of rewards: one for each correct step in the problem-solving process, and another for making progress toward a correct final answer. They combined these rewards using a special training setup that lets the AI learn both at the same time, improving its ability to reason through complex math problems.
Why it matters?
This matters because better step-by-step mathematical reasoning in AI can make it more reliable and useful for solving difficult problems, which is helpful for education, science, and technology.
Abstract
A novel reward modeling framework DuaShepherd integrates correctness and potential signals into a unified multi-head architecture to enhance LLMs' mathematical reasoning capabilities and achieve state-of-the-art performance.