SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue
2025-10-13
Summary
This research focuses on improving how computers understand the spatial relationships within scenes, like understanding where things are in a room or on a street. It aims to make these systems better at tasks like robotics and self-driving cars, which require a strong understanding of space.
What's the problem?
Currently, teaching computers to understand space is difficult because it relies heavily on detailed 3D scans of indoor areas and a lot of manual labeling of objects. Also, existing methods often struggle to understand scenes at *all* different scales – from recognizing a small object on a table to understanding the layout of an entire city. This leads to systems that work well in one specific environment but fail when things change or are viewed from a different distance.
What's the solution?
The researchers created a new dataset called SpaceVista-1M, which contains over a million questions and answers about spatial relationships in over 38,000 videos across different scales. They built this dataset using a semi-automated process to reduce the need for manual labeling. They also developed a new model, SpaceVista-7B, that’s designed to handle information at different scales and learn progressively, avoiding conflicts between different types of spatial knowledge. The model uses scale as a key piece of information to help it understand the scene.
Why it matters?
This work is important because it provides a large, diverse dataset and a new model that can significantly improve a computer’s ability to understand spatial relationships. This advancement is crucial for building more reliable and adaptable robots and self-driving cars, and for other applications that require a strong understanding of the physical world. The dataset and model are being released publicly to help other researchers build on this work.
Abstract
With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .