Process Reward Models That Think
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
2025-04-25
Summary
This paper talks about ThinkPRM, a new kind of AI model that can check and judge long step-by-step reasoning processes, making sure each step makes sense, even when it doesn't get a lot of direct guidance from humans.
What's the problem?
The problem is that most reward models, which are supposed to help AI judge the quality of answers or solutions, aren't very good at handling complex, multi-step explanations. They often just look at the final answer instead of checking if the whole reasoning process is logical and correct.
What's the solution?
The researchers created ThinkPRM, which acts like a careful verifier that goes through each step of a chain-of-thought and checks if it makes sense. It uses very little supervision, meaning it doesn't need a ton of labeled examples to learn, but still manages to do better than older models and even AI judges on a wide range of tests.
Why it matters?
This matters because it helps AI become more trustworthy and reliable, especially for tasks that require careful thinking and explanation, like solving math problems, giving advice, or making decisions that affect people.
Abstract
ThinkPRM, a long chain-of-thought verifier, uses minimal supervision to outperform discriminative PRMs and LLM-as-a-Judge across various benchmarks and token budgets.