Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi

2025-12-19

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Summary

This research introduces a new computer vision model that's really good at understanding depth in panoramic images, meaning it can accurately judge how far away things are in a 360-degree view, even when those things are at very different distances.

What's the problem?

Existing methods for estimating depth in images often struggle when dealing with wide, panoramic views and especially when objects are both close up and far away. It's hard to create enough real-world data to train these models effectively, and there's a difference between how things look in real photos versus computer-generated images, making it difficult for models to generalize well to new scenes.

What's the solution?

The researchers tackled this by creating a huge dataset combining existing images, realistic images made with a game engine (UE5), and images generated from text descriptions. They also developed a clever process to automatically improve the quality of the labels in the dataset. The model itself uses a powerful pre-trained vision system (DINOv3-Large) and adds special components to focus on getting accurate depth measurements at all distances and ensuring the depth information is consistent from different viewpoints.

Why it matters?

This work is important because it allows for more accurate 3D understanding of the world from panoramic images. This has applications in areas like virtual reality, augmented reality, robotics, and self-driving cars, where knowing the distance to objects is crucial for creating realistic experiences and safe navigation. The model's ability to work well in new, unseen environments is a significant step forward.

Abstract

In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: https://insta360-research-team.github.io/DAP_website/ {https://insta360-research-team.github.io/DAP\_website/}

View Paper