Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal

2026-04-06

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

Summary

This research investigates whether advanced AI models, specifically Multimodal Large Language Models (MLLMs) – those that can understand both images and text – can collaborate and build a shared understanding of a space just like humans do when giving directions or describing a scene to each other.

What's the problem?

Currently, AI struggles with understanding space the way people do. When two people look at the same room from different angles, they can easily talk to each other and figure out where things are relative to each other, creating a mental map. This paper asks if AI can do the same thing: take different viewpoints and, through conversation, build a complete and accurate understanding of a shared environment. Existing models aren't very good at this, especially when it comes to complex reasoning about spatial relationships or creating a full 'map' of the area.

What's the solution?

The researchers created a new challenge called COSMIC, which involves two AI agents 'looking' at a 3D indoor environment from different spots and communicating using natural language to answer questions about the space. They tested several MLLMs on this challenge, measuring how well they could identify common objects, understand how things relate to each other, and ultimately build a consistent mental model of the environment. They also compared the AI’s performance to how humans performed in the same task, recording human conversations to see how they naturally build shared understanding.

Why it matters?

This work is important because it highlights a key limitation of current AI: the ability to truly understand and reason about the physical world in a way that’s similar to humans. Improving this capability is crucial for building robots that can navigate and interact with our environment effectively, or for creating AI assistants that can help us with tasks involving spatial reasoning, like interior design or giving directions. The research shows there's still a significant gap between AI and human performance in collaborative spatial understanding, and points to areas where AI needs to improve.

Abstract

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic

View Paper