AnyDepth: Depth Estimation Made Easy
Zeyu Ren, Zeyu Zhang, Wukai Li, Qingxiang Liu, Hao Tang
2026-01-12
Summary
This paper focuses on figuring out how far away objects are in a picture, a process called monocular depth estimation, meaning it tries to do this using just one image. The researchers developed a new system to do this more efficiently and accurately, even when it hasn't been specifically trained on the type of images it's seeing.
What's the problem?
Existing methods for estimating depth from single images often require huge amounts of training data and complex computer programs, making them slow and difficult to adapt to new situations. These systems, like one called DPT, are powerful but have a lot of unnecessary complexity and need massive datasets to work well, limiting their usefulness in real-world applications.
What's the solution?
The researchers created a new framework called AnyDepth. They used a powerful image analyzer called DINOv3 to understand the images, then built a simpler 'decoder' called the Simple Depth Transformer (SDT) to actually calculate the depth. This decoder is much smaller and faster than the one used in DPT, reducing the number of calculations needed by a large amount. They also developed a way to automatically remove poor-quality images from the training data, improving the overall accuracy of the system.
Why it matters?
This work shows that you don't always need massive models and datasets to get good results in depth estimation. By focusing on a smarter design and better data quality, they were able to create a system that performs as well as, or even better than, more complex systems, making it more practical for use in various applications like robotics and self-driving cars.
Abstract
Monocular depth estimation aims to recover the depth information of 3D scenes from 2D images. Recent work has made significant progress, but its reliance on large-scale datasets and complex decoders has limited its efficiency and generalization ability. In this paper, we propose a lightweight and data-centric framework for zero-shot monocular depth estimation. We first adopt DINOv3 as the visual encoder to obtain high-quality dense features. Secondly, to address the inherent drawbacks of the complex structure of the DPT, we design the Simple Depth Transformer (SDT), a compact transformer-based decoder. Compared to the DPT, it uses a single-path feature fusion and upsampling process to reduce the computational overhead of cross-scale feature fusion, achieving higher accuracy while reducing the number of parameters by approximately 85%-89%. Furthermore, we propose a quality-based filtering strategy to filter out harmful samples, thereby reducing dataset size while improving overall training quality. Extensive experiments on five benchmarks demonstrate that our framework surpasses the DPT in accuracy. This work highlights the importance of balancing model design and data quality for achieving efficient and generalizable zero-shot depth estimation. Code: https://github.com/AIGeeksGroup/AnyDepth. Website: https://aigeeksgroup.github.io/AnyDepth.