From Editor to Dense Geometry Estimator

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao

2025-09-05

Summary

This paper explores a better way to teach computers to understand the 3D shape of objects in images, specifically estimating depth and surface normals. It argues that using image *editing* models, rather than image *generating* models, is a more effective approach for this task.

What's the problem?

Currently, many methods for figuring out the 3D structure of a scene rely on models originally designed to *create* images from text. However, understanding the 3D shape of an existing image is more like *changing* an image, which suggests that models built for image editing might be a better starting point. The challenge is that these editing models aren't naturally set up to give precise 3D information, and they work with different levels of detail than what's needed for accurate depth and normal estimation.

What's the solution?

The researchers developed a new framework called FE2E that adapts a powerful image editing model (based on Diffusion Transformer technology) for the specific task of predicting depth and surface normals. They did this by changing how the model learns – focusing on refining existing features instead of generating new ones – and by making adjustments to handle the precision needed for accurate 3D measurements. They also cleverly used the model’s ability to look at the whole image at once to estimate both depth and normals simultaneously, helping each prediction improve the other.

Why it matters?

This work is significant because it achieves much better results in estimating depth and surface normals compared to previous methods, even those trained on vastly larger datasets. It demonstrates that image editing models have a natural advantage for this type of 3D understanding, and FE2E provides a practical way to leverage that advantage. This could lead to improvements in areas like robotics, self-driving cars, and augmented reality, where understanding the 3D world is crucial.

Abstract

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

View Paper