Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

2025-04-11

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Summary

This paper talks about Geo4D, a new method that uses video-generating AI models to help computers build detailed 3D models of scenes that change over time, like people moving or cars driving. It takes regular videos and turns them into 4D reconstructions, meaning it captures both the 3D shape and how it changes as time passes.

What's the problem?

The main problem is that making accurate 3D models from regular videos, especially when things in the scene are moving, is really hard. Existing methods struggle with dynamic scenes and often need lots of real-world 3D data, which is difficult and expensive to collect. Most AI models also have trouble combining all the different pieces of information needed to get a complete, accurate 3D and time-based (4D) result.

What's the solution?

Geo4D solves this by taking advantage of video diffusion models, which are AI systems trained to generate realistic videos. These models already understand how things move and change in videos, so Geo4D uses them to predict important 3D details like points, depth, and rays for each frame. It then uses a special algorithm to align and combine all this information from different parts of the video, resulting in a strong and accurate 4D reconstruction, even when trained only on computer-generated data.

Why it matters?

This work matters because it makes it much easier and more reliable to create 3D models of real-life scenes just from videos, even when things are moving. This can help in areas like virtual reality, movie making, robotics, and scientific research, where understanding how things look and move in 3D is super important.

Abstract

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.

View Paper