ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, Angel X. Chang
2026-04-28
Summary
This paper focuses on the problems with how we currently test how well artificial intelligence understands spaces and objects in videos, specifically vision-language models (VLMs). It introduces a new, more reliable way to evaluate these models' 'spatial intelligence'.
What's the problem?
Existing tests for spatial intelligence in AI are flawed because they often use data originally created for different purposes, like analyzing 3D scans. When applied to videos, this data can be inaccurate – objects visible in the video might be missing from the data, or incorrectly labeled. Also, many tests assume the AI can see the entire scene at once, but often these models only process a few snapshots from the video, making some questions impossible to answer with the information they have.
What's the solution?
The researchers created a new benchmark called ReVSI. They carefully re-annotated objects and their shapes in videos from several datasets, ensuring the data is accurate and reflects what's actually visible. They also re-wrote the questions to make sure they can be answered using only the limited information the AI actually receives (like those few snapshots). ReVSI also allows testing with different amounts of video data to see how that affects performance.
Why it matters?
This work is important because it provides a more accurate way to measure how well AI understands spatial relationships and objects in videos. By revealing weaknesses in current models that were hidden by flawed testing methods, it helps researchers build better and more reliable AI systems that can truly 'see' and understand the world around them.
Abstract
Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.