How Much 3D Do Video Foundation Models Encode?

Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg

2025-12-26

How Much 3D Do Video Foundation Models Encode?

Summary

This research investigates whether video models, after being shown a lot of videos, naturally develop an understanding of the 3D world those videos represent.

What's the problem?

We watch the world in 2D on screens, but it's actually 3D. The question is, if you train a computer to analyze tons of videos, will it automatically learn about depth and 3D space, even though it's only seeing flat images? Existing video models haven't been thoroughly tested to see *how much* 3D understanding they actually have.

What's the solution?

The researchers created a new method to test different video models' 3D awareness. They didn't change the models themselves, but instead looked at the information *inside* the models – how they represent what they 'see' – and used simple calculations to estimate if the models understood things like object shape and scene layout in 3D. They tested several popular video models.

Why it matters?

It turns out that some of the best video *generation* models (the ones that can create videos) already have a surprisingly good grasp of 3D, even though they weren't specifically trained for 3D tasks. This is important because it suggests we might be able to build powerful 3D models more easily by starting with these existing video models, rather than building everything from scratch. It also gives us a way to measure and compare the 3D understanding of different video models.

Abstract

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

View Paper