3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen
2024-12-12

Summary
This paper talks about 3DSRBench, a new benchmark designed to evaluate how well AI models can understand and reason about 3D spaces, which is important for tasks like navigation and robotics.
What's the problem?
AI models have made great progress in understanding images and videos, but they often struggle with 3D spatial reasoning. This means they have difficulty analyzing the positions and relationships of objects in three-dimensional space. Without proper evaluation tools, it's hard to know how well these models perform in real-world scenarios where understanding 3D space is crucial.
What's the solution?
To address this issue, the authors created 3DSRBench, which includes 2,772 visual question-answer pairs that test various aspects of 3D reasoning. They designed the benchmark to cover different types of questions and included images taken from both common and uncommon viewpoints. By testing various AI models on this benchmark, they were able to identify their strengths and weaknesses in understanding 3D scenes.
Why it matters?
This research is important because it provides a comprehensive tool for evaluating AI models' abilities to reason about 3D spaces. By improving how AI understands spatial relationships, we can enhance applications in fields like robotics, autonomous vehicles, and augmented reality, making technology more effective and reliable in navigating the real world.
Abstract
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.