A novel reward modeling framework DuaShepherd integrates correctness and potential signals into a unified multi-head architecture to enhance LLMs' mathematical reasoning capabilities and achieve state-of-the-art performance.

This paper talks about DuaShepherd, a new system that helps large AI models get better at solving math problems by rewarding them not just for the final answer but also for making correct steps and following good paths along the way.

DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract