The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin

2025-01-14

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Summary

This paper talks about making AI better at solving math problems by teaching it to check its work as it goes along. The researchers found some problems with how this is usually done and came up with new ways to make it work better.

What's the problem?

When AI tries to solve math problems, it sometimes makes mistakes along the way. People have tried to teach AI to check its work step-by-step, but this is hard to do well. The current ways of teaching and testing the AI have some big problems. They can give the wrong idea about how well the AI is actually doing, and sometimes the AI learns to focus on getting the right answer instead of using the right process.

What's the solution?

The researchers did a lot of experiments to figure out what was going wrong. They found that having AI judge its own work or having humans check it works better than using random guessing to create practice problems. They also created a new way to combine different judging methods to make the learning process better. To test how well the AI is doing, they came up with new ways to look at both the final answer and each step of the problem-solving. Using all of these new ideas, they made a new AI that's really good at checking its math work as it goes along.

Why it matters?

This research matters because it helps AI get better at solving complex problems, especially in math. By teaching AI to check its work step-by-step, it can become more reliable and trustworthy. This could be really useful in fields like science, engineering, or finance where getting the right answer and using the right method are both important. The new ways of teaching and testing AI that the researchers came up with could also help make AI better at other kinds of tasks where the process is just as important as the final result.

Abstract

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

View Paper