Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

2025-11-25

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Summary

This paper focuses on improving how teams of AI agents work together, specifically when each agent has a different job to do.

What's the problem?

Currently, multi-agent systems, while good at general tasks, struggle with specialized areas because they're usually trained with one 'brain' for all agents. This is like trying to make all students in a school learn the exact same thing – it doesn't account for different roles and needs. Training each agent separately with its own 'brain' seems like a good idea, but it's technically difficult because agents work at different speeds, some need to be used more than others during a task, and they might even be running on different computers, making it hard to update their learning all at once.

What's the solution?

The researchers developed a new training method called M-GRPO. Think of it like a coach who understands how each player on a team contributes to the overall success. M-GRPO figures out how much each agent, both the 'planner' and the 'tool users', deserves credit for the team's performance. It also cleverly organizes the training data, even when some agents are used more often than others, ensuring everyone gets a fair chance to learn. Importantly, it allows agents to be trained on separate computers without needing constant back-and-forth communication during training, making the process much faster and more efficient.

Why it matters?

This work is important because it allows for the creation of more capable and efficient multi-agent systems. By letting each agent specialize and optimizing their training independently, these systems can perform complex tasks, like web searching or answering questions, much better than before. This is a step towards building AI teams that can tackle real-world problems that are too difficult for a single AI to handle.

Abstract

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

View Paper