Continuous 3D Perception Model with Persistent State

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, Angjoo Kanazawa

2025-02-10

Continuous 3D Perception Model with Persistent State

Summary

This paper talks about CUT3R, a new AI system that can create 3D models of scenes from a series of images or videos, even filling in parts it hasn't directly seen.

What's the problem?

Current 3D modeling systems often struggle to handle different types of input (like videos or collections of photos) and have trouble creating complete 3D models when they don't have all the information. They also typically can't update their models in real-time as new images come in.

What's the solution?

The researchers created CUT3R, which uses a special type of AI called a stateful recurrent model. This model constantly updates its understanding of the scene with each new image it sees. It can create 3D point maps for each image and combine them into a full 3D model. CUT3R can even guess what unseen parts of the scene might look like based on what it has learned from other scenes.

Why it matters?

This matters because it could make 3D modeling much easier and more accurate in many fields. For example, it could help create better virtual reality experiences, improve computer vision for self-driving cars, or assist in creating 3D models for movies or video games. The ability to work with different types of input and update in real-time makes it very flexible for various applications.

Abstract

We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each. Project Page: https://cut3r.github.io/

View Paper