The Collaboration Gap

Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz, Ece Kamar

2025-11-05

Summary

This paper investigates how well different AI programs can work together, specifically when they each have different abilities and can't see everything that's going on. It highlights a surprising problem: AI that's good at tasks on its own often struggles when it needs to collaborate with another AI.

What's the problem?

As AI gets more advanced, we're moving towards systems where many different AI 'agents' work together to solve problems. However, there hasn't been much research actually *testing* how well these agents collaborate, especially when they don't have complete information about the problem or each other. The core issue is that AI that performs well individually doesn't automatically translate to good teamwork, and we don't really understand why or how to fix it.

What's the solution?

The researchers created a maze-solving challenge specifically designed to test AI collaboration. They tested 32 different AI models, both publicly available and proprietary, in three ways: solving mazes alone, working with another AI of the same type, and working with a different AI. They found that collaboration often led to worse results than individual performance. To improve things, they tried having the stronger AI take the lead initially and then pass the problem to the weaker AI, which significantly improved the overall success rate.

Why it matters?

This research is important because it shows that building AI systems that rely on teamwork isn't as simple as just combining individually capable AI programs. It points to the need for new ways to evaluate AI, specifically focusing on collaborative skills, and for new training methods that teach AI how to work effectively with others. These findings apply not only to AI-to-AI collaboration but also to how we design AI systems to work *with* humans.

Abstract

The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a "collaboration gap": models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a "relay inference" approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents' latent skills, guidance that applies to AI-AI and human-AI collaboration.

View Paper