SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
2025-05-23
Summary
This paper introduces SpatialScore, a new way to test how well large AI models that handle both language and images or videos can understand 3D space and spatial relationships. It also presents SpatialAgent, a system that uses special tools to help these models get better at spatial reasoning.
What's the problem?
While multimodal large language models (the kind that work with both text and images) are good at answering questions, they aren't very good at understanding or reasoning about 3D space, like figuring out where things are in a room or how objects move and relate to each other. There wasn't a single, unified way to measure how well these models handle spatial tasks, and existing tests didn't cover enough ground.
What's the solution?
The authors created SpatialScore, a big benchmark made up of over 28,000 questions from many different datasets, covering all sorts of spatial challenges like object location, depth, camera movement, and more. They also made a tough subset called SpatialScore-Hard to really push the models. To help models do better, they built SpatialAgent, which is like a smart assistant that uses nine different specialized tools to break down and solve spatial problems step by step. They tested many popular models and showed that SpatialAgent, with its tool-based approach, helps models perform much better on the hardest spatial tasks.
Why it matters?
This work matters because as AI gets used in more real-world situations—like robotics, self-driving cars, and virtual reality—it's crucial that these models truly understand 3D space, not just answer simple questions. SpatialScore gives researchers a clear, tough way to measure progress, while SpatialAgent shows a promising path for making AI smarter about the physical world. This could lead to safer, more reliable, and more capable AI systems in the future.
Abstract
SpatialScore benchmarks multimodal large language models for 3D spatial understanding, revealing challenges and showcasing the effectiveness of SpatialAgent with specialized tools.