SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao

2025-09-12

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Summary

This paper introduces a new dataset called SpatialVID, designed to help computers better understand and interact with the 3D world from videos.

What's the problem?

Currently, teaching computers to understand space – like how objects move and where things are in a scene – is limited because there isn't enough good training data. Existing datasets are often too small, don't show enough variety in real-world situations, or lack detailed information about camera movement and the 3D structure of scenes, especially when things are moving around. It's hard for models to learn to 'see' the world accurately without enough realistic examples.

What's the solution?

The researchers created SpatialVID by collecting over 21,000 hours of real-world videos and carefully filtering them down to about 7,089 hours of dynamic content. Then, they added a lot of detailed information to these videos, including the camera's position and angle in each frame, depth information (how far away things are), outlines of moving objects, descriptions of what's happening, and even instructions on how the camera is moving. This creates a rich dataset for training spatial intelligence models.

Why it matters?

SpatialVID is important because it provides a much larger and more detailed dataset than what was previously available. This allows researchers to build computer models that can better understand and navigate the real world, improving things like robotics, self-driving cars, and virtual reality. The dataset’s richness and diversity should help models generalize better to new, unseen situations.

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

View Paper