MASPRM: Multi-Agent System Process Reward Model

Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong

2025-10-30

MASPRM: Multi-Agent System Process Reward Model

Summary

This paper introduces a new method called MASPRM, which helps teams of artificial intelligence agents work together more effectively and efficiently, especially when solving complex problems like math problems.

What's the problem?

When you have multiple AI agents trying to solve a problem together, it can be slow and inefficient to have them all explore every possible solution path. It's like a group project where everyone is brainstorming every idea, even the bad ones, instead of focusing on the most promising approaches. Existing methods struggle to decide *where* to focus the AI's processing power to get the best results quickly.

What's the solution?

MASPRM acts like a coach for these AI teams. It learns to predict how good each step taken by each agent is, based on what has happened so far. This allows it to guide the agents to explore the most promising solution paths and quickly discard those that are unlikely to lead to success. It's trained by letting the agents practice solving problems and then learning from their successes and failures, without needing humans to label every single step. During actual problem-solving, MASPRM helps the agents search for answers more strategically, either by narrowing down options or by focusing on the most likely paths.

Why it matters?

This research is important because it makes multi-agent AI systems much more practical. By improving their speed and accuracy, especially on challenging tasks like complex math problems, it opens the door to using these systems in real-world applications where quick and reliable solutions are crucial. The fact that it can even work on new types of problems without needing to be retrained is a big step forward, showing its adaptability and potential.

Abstract

Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by +30.7 and +22.9 points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding 8.4 EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM

View Paper