V-DPM: 4D Video Reconstruction with Dynamic Point Maps
Edgar Sucar, Eldar Insafutdinov, Zihang Lai, Andrea Vedaldi
2026-01-16
Summary
This paper introduces a new way to represent moving 3D scenes for computers, building on existing techniques that work well for static scenes and single pairs of images.
What's the problem?
Current methods for representing dynamic 3D scenes, called Dynamic Point Maps, only work well with two images at a time and need extra steps to process more images. They aren't really designed for videos, where you have many images showing changes over time, and don't fully capture the movement of everything in the scene.
What's the solution?
The researchers created a new system called V-DPM, which adapts Dynamic Point Maps to work with videos. They figured out how to best represent the changing 3D information in a video and then used a powerful existing 3D reconstruction program, VGGT, and trained it a little bit with fake data to handle moving scenes. This allows the system to predict the 3D structure and motion from a video.
Why it matters?
This work is important because it allows computers to create more accurate and complete 3D models of moving scenes from videos. Unlike other recent approaches, V-DPM doesn't just show *how far* things are moving, but also *where* every single point in the scene is going, leading to better 3D and 4D reconstructions.
Abstract
Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally representing scene motion. However, existing DPMs are limited to image pairs and, like DUSt3R, require post processing via optimization when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to formulate DPMs for video input in a way that maximizes representational power, facilitates neural prediction, and enables reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent and powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data is sufficient to adapt it into an effective V-DPM predictor. Our approach achieves state of the art performance in 3D and 4D reconstruction for dynamic scenes. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the full 3D motion of every point in the scene.