A framework, TR2M, uses multimodal inputs to rescale relative depth to metric depth, enhancing performance across various datasets through cross-modality attention and contrastive learning.

This paper talks about TR2M, a system that helps convert relative depth information from a single image into real-world metric depth measurements by using both images and text descriptions together.

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract