PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

Rituraj Sharma, Weiyuan Chen, Noah Provenzano, Tu Vu

2026-03-04

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

Summary

This paper focuses on improving the reasoning abilities of AI systems, specifically those that work by exploring many different potential solutions to a problem at once, a method called DEEPTHINK. It introduces a new technique, PRISM, to make these systems more accurate and efficient.

What's the problem?

DEEPTHINK systems are good at tackling hard problems in math and science, but they often struggle because they don't have a good way to check if the solutions they're developing are actually correct *while* they're working on them. This means that if a system starts going down the wrong path, it can keep making mistakes and amplify those errors, even if a few correct ideas are present. Essentially, more thinking doesn't always lead to better answers, and can sometimes make things worse.

What's the solution?

The researchers propose PRISM, which stands for Process Reward Model-guided inference. Think of it like giving the AI system little 'checkpoints' during its reasoning process. PRISM breaks down the problem-solving into steps and verifies each step, using this feedback to guide the system towards better solutions. It treats potential solutions as particles moving in an 'energy landscape' where better solutions are more stable, and it reshapes the pool of solutions to focus on the most promising ones while still keeping some variety. This helps avoid getting stuck on incorrect paths and allows the system to refine its thinking effectively.

Why it matters?

This work is important because it addresses a key limitation of DEEPTHINK methods. By providing a way to reliably evaluate progress during reasoning, PRISM makes these systems more trustworthy and capable of solving complex problems. The results show PRISM performs as well as, or even better than, existing methods, and it does so efficiently, meaning it gets better results without necessarily needing a huge amount of computing power. This could lead to significant advancements in AI's ability to tackle challenging tasks in fields like mathematics and science.

Abstract

DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.

View Paper