TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast
Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren
2025-06-18
Summary
This paper talks about TR2M, a system that helps convert relative depth information from a single image into real-world metric depth measurements by using both images and text descriptions together.
What's the problem?
The problem is that most methods only estimate how objects relate to each other in terms of depth (like which is closer or farther), but they don’t give exact distances in real units like meters, which is important for things like robotics and 3D modeling.
What's the solution?
The researchers created a framework that combines image data and text descriptions to rescale relative depth into metric depth using advanced attention techniques and contrastive learning. They also generate supervision by aligning relative depth with known ground truth data and filtering confident estimates to improve training.
Why it matters?
This matters because being able to accurately estimate real distances from just one image helps many practical applications like autonomous driving, augmented reality, and robot navigation, making these systems safer and more effective.
Abstract
A framework, TR2M, uses multimodal inputs to rescale relative depth to metric depth, enhancing performance across various datasets through cross-modality attention and contrastive learning.