MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You
2025-03-05

Summary
This paper talks about MultiAgentBench, a new way to test how well AI language models can work together and compete in different situations, like a virtual obstacle course for AI teamwork and rivalry.
What's the problem?
Current tests for AI language models usually focus on tasks for just one AI or are limited to specific areas. This doesn't show how well AIs can work together or compete in more complex, real-world situations where multiple AIs need to interact.
What's the solution?
The researchers created MultiAgentBench, which puts AI language models through various scenarios where they have to work together or compete. It measures not just if they complete tasks, but how well they collaborate or compete. They also tested different ways for the AIs to communicate and make plans together, including a method called cognitive planning that improved how often the AIs reached important milestones.
Why it matters?
This matters because as AI becomes more advanced, we need to understand how multiple AIs can work together or compete effectively. MultiAgentBench helps researchers improve AI teamwork and decision-making in complex situations, which could lead to better AI assistants, more efficient problem-solving in fields like science or business, and even help us understand how to make AIs that can cooperate safely and effectively with humans.
Abstract
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.