Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang

2025-11-14

Depth Anything 3: Recovering the Visual Space from Any Views

Summary

This paper introduces Depth Anything 3, a new computer vision model designed to understand the 3D geometry of scenes from images, even if the camera's position isn't known. It aims to create detailed and accurate depth maps from various visual inputs.

What's the problem?

Existing methods for understanding 3D geometry from images often require complex model designs and lots of different types of training. They can also struggle to accurately predict depth and camera positions, especially when dealing with images taken from different viewpoints or without knowing where the camera was located. Creating a single, robust model that can handle these challenges efficiently was the core issue.

What's the solution?

The researchers found that a surprisingly simple design – a standard 'transformer' network – works really well for this task. They also realized they could focus the model's training on predicting depth along individual 'rays' of light, instead of trying to predict multiple things at once. They used a 'teacher-student' training approach, where a more complex model initially guides the learning of the simpler Depth Anything 3 model, allowing it to achieve high accuracy and detail. They also created a new set of tests to evaluate how well these models perform.

Why it matters?

This work is important because it simplifies the process of creating 3D scene understanding models. By showing that a basic transformer can achieve state-of-the-art results, it opens the door for more efficient and accessible 3D vision applications. The improved accuracy in depth estimation and camera pose prediction has implications for things like robotics, augmented reality, and self-driving cars, where understanding the 3D world is crucial.

Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

View Paper