Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, Chi Zhang

2025-02-27

Distill Any Depth: Distillation Creates a Stronger Monocular Depth
Estimator

Summary

This paper talks about a new way to improve how computers estimate depth in images using just one camera. The researchers found better methods to teach computers this skill by combining different techniques and using multiple 'teacher' models.

What's the problem?

Estimating depth from a single image is tricky for computers. Current methods sometimes struggle with accuracy, especially when dealing with different types of scenes. The existing techniques for teaching computers this skill can sometimes make mistakes worse by amplifying errors in the training data.

What's the solution?

The researchers came up with two main solutions. First, they developed a method called 'Cross-Context Distillation' that looks at both the whole image and smaller parts of it to get better depth estimates. Second, they used multiple 'teacher' models instead of just one, each with its own strengths, to train the computer. This helps the computer learn more accurately and handle different types of scenes better.

Why it matters?

This research matters because accurate depth estimation from single images is crucial for many technologies we use today and will use more in the future. It's important for things like self-driving cars, augmented reality, and robotics. By making depth estimation more accurate and reliable across different types of scenes, this research could lead to safer and more effective technologies in these areas. It could also help improve computer vision systems in general, which has wide-ranging applications in fields from medicine to entertainment.

Abstract

Monocular depth estimation (MDE) aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding. Recent advances in zero-shot MDE leverage normalized depth representations and distillation-based learning to improve generalization across diverse scenes. However, current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies on pseudo-label distillation. Based on our findings, we propose Cross-Context Distillation, which integrates global and local depth cues to enhance pseudo-label quality. Additionally, we introduce a multi-teacher distillation framework that leverages complementary strengths of different depth estimation models, leading to more robust and accurate depth predictions. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.

View Paper