Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan

2025-12-03

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Summary

This research explores whether AI models that create videos can also demonstrate 'visuospatial intelligence,' which is the ability to understand and reason about the world in terms of space and objects, just like humans do, and if they can do this simply by watching videos.

What's the problem?

Current AI models often need extra information, like depth maps or precise camera positions, to understand spatial relationships in videos. This research tackles the problem of whether a model can develop a strong understanding of space and objects *only* from the visual information present in a video, without any additional cues. Essentially, can an AI 'watch' a video and figure out how to navigate a space or find objects within it, similar to how a person would?

What's the solution?

The researchers created a system called Video4Spatial. This system uses a type of AI called a 'video diffusion model' and trains it to perform two tasks using only video footage: navigating through a scene based on instructions, and identifying specific objects within the video. They carefully designed the system and selected the training videos to help the AI learn these skills effectively. The model learns to plan routes and locate objects directly from the video, without needing any extra data about the environment.

Why it matters?

This work is important because it moves AI closer to having a more general understanding of the world. If AI can understand spatial relationships just from watching videos, it opens up possibilities for more realistic and helpful applications, like robots that can navigate homes based on visual input or AI assistants that can understand and respond to instructions about objects in a video. It suggests that video alone contains enough information for AI to develop a surprisingly sophisticated understanding of its surroundings.

Abstract

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

View Paper