ProcessBench: Identifying Process Errors in Mathematical Reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin

2024-12-10

ProcessBench: Identifying Process Errors in Mathematical Reasoning

Summary

This paper talks about ProcessBench, a new tool designed to help identify mistakes in the reasoning process of language models when they solve math problems.

What's the problem?

Language models, which are AI systems that can understand and generate text, often make errors when solving math problems. These errors can be hard to catch because existing methods mainly check if the final answer is correct, without looking at how the model got there. This means that even if the answer is right, the reasoning leading to it might be flawed, making it difficult to trust the model's solutions.

What's the solution?

To address this issue, the authors created ProcessBench, which includes 3,400 test cases focused on advanced math problems similar to those found in competitions. Each test case has a step-by-step solution with errors marked by human experts. The goal is for models to find the first mistake in the reasoning process or confirm that all steps are correct. The researchers evaluated different types of models, including process reward models (PRMs) and critic models, to see how well they could identify these errors. They found that existing PRMs struggled with more difficult problems but that some open-source models performed surprisingly well.

Why it matters?

This research is important because it provides a way to improve how we evaluate AI systems when they tackle complex tasks like math. By focusing on the reasoning process rather than just the final answer, ProcessBench helps ensure that language models are more reliable and accurate in their problem-solving abilities. This can lead to better applications of AI in education and other fields where precise reasoning is essential.

Abstract

As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

View Paper