Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi SM Sajjadi

2025-12-10

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Summary

This paper introduces a new computer vision model called D4RT that's designed to understand and recreate 3D scenes from videos, figuring out both the shape of objects and how they move over time.

What's the problem?

Understanding moving scenes in videos is really hard for computers. Existing methods struggle to efficiently determine the depth of objects, how different parts of the scene relate to each other across time, and the exact position and orientation of the camera that recorded the video. Many approaches are computationally expensive or require separate systems for each task, making them slow and complex.

What's the solution?

D4RT solves this by using a single, streamlined 'transformer' system. Instead of trying to decode the entire scene frame by frame, it uses a clever 'querying' method. This allows the model to directly ask about the 3D location of any point in the video at any given time, without needing to process everything in detail. This makes it faster and more flexible than previous methods.

Why it matters?

This research is important because it significantly improves the speed and accuracy of 3D scene reconstruction from videos. By setting a new standard in performance, D4RT opens up possibilities for more realistic virtual reality, better robotics, and more advanced video analysis applications.

Abstract

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.

View Paper