Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

2025-09-12

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Summary

This paper focuses on the difficulty current Vision-Language Models (VLMs) have understanding 3D space, like how objects relate to each other in a scene. They introduce a new testing ground called Ego3D-Bench and a method, Ego3D-VLM, to help these models improve.

What's the problem?

Existing VLMs struggle with understanding spatial relationships, especially when looking at the world from a first-person perspective – like a robot or driver seeing multiple views of their surroundings. Previous tests used single images or indoor videos, which aren't realistic for how robots and self-driving cars actually experience the world. The models aren't performing at a human level when asked questions about 3D space.

What's the solution?

The researchers created Ego3D-Bench, a large collection of over 8,600 questions and answers about 3D scenes captured from a first-person viewpoint. They then developed Ego3D-VLM, which helps VLMs build a kind of 'mental map' of the environment by estimating the 3D location of objects. This 'map' allows the VLM to answer spatial questions more accurately, improving performance by up to 56% on certain tasks. Importantly, Ego3D-VLM can be added to existing VLMs without major changes.

Why it matters?

This work is important because it provides a more realistic way to test and improve the spatial reasoning abilities of VLMs. Better spatial understanding is crucial for building AI agents, like robots and self-driving cars, that can safely and effectively navigate and interact with the real world. The new benchmark and improvement method offer tools to push these models closer to human-level performance.

Abstract

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

View Paper