COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu
2025-12-08
Summary
This paper focuses on improving how well AI models that understand both images and text, called Multimodal Large Language Models, can reason about 3D space and the relationships between objects within that space.
What's the problem?
Current AI models struggle with truly understanding 3D scenes. Existing attempts to fix this usually focus on either improving how the model *sees* the image – by adding extra information like depth maps or object outlines – or improving the model’s *reasoning* abilities, like by training it to answer questions about spatial arrangements. These approaches treat perception and reasoning as separate issues, and don't allow the model to learn how they connect.
What's the solution?
The researchers created a new model called COOPER. COOPER is designed to learn both how to generate helpful visual information (like depth and outlines) *and* how to use that information to reason about space in an ongoing, back-and-forth process. It's trained in two steps: first to create the extra visual information, and then to use that information to improve its spatial reasoning skills. This allows the model to improve its understanding of space by actively generating and using additional visual cues.
Why it matters?
This work is important because it shows that a single AI model can learn to both perceive spatial information more effectively and reason about it more accurately. The fact that even just learning to *create* the extra visual information improved spatial understanding suggests that the model is developing a deeper, more internal understanding of 3D space, which could lead to more capable and reliable AI systems.
Abstract
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose COOPER, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average 6.91\% improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a 7.92\% gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.