Multi-Agent Tool-Integrated Policy Optimization

Zhanfeng Mo, Xingxuan Li, Yuntao Chen, Lidong Bing

2025-10-09

Summary

This paper introduces a new method called Multi-Agent Tool-Integrated Policy Optimization, or MATPO, which improves how large language models handle complex tasks that require using external tools and reasoning over a lot of information.

What's the problem?

Large language models are getting better at tackling complicated problems, especially when they can use tools like search engines or calculators. However, current systems often struggle because they can only process a limited amount of information at once, and the tools they use don't always give perfect answers. Trying to use multiple 'agents' – separate programs each with a specific job – seems like a good solution, but it's been difficult to train these multi-agent systems effectively after the initial programming.

What's the solution?

The researchers developed MATPO, a way to train a single large language model to act as *both* a planner and a worker. The planner figures out what steps to take, and the worker actually uses the tools to carry out those steps. Importantly, they use reinforcement learning, a type of training where the model learns by trial and error, to improve how well each 'role' performs. This is done using carefully designed prompts that tell the model which role it should be playing at any given time, all within a single model instance, avoiding the need for multiple models.

Why it matters?

This work is important because it shows a practical way to build more powerful and reliable AI systems that can handle complex tasks. By efficiently using a single language model for multiple roles and improving training through reinforcement learning, MATPO offers a significant performance boost and makes these systems more robust to errors from the tools they use. It provides a pathway to creating AI that can better reason, plan, and interact with the real world.

Abstract

Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.

View Paper