Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Mengtan Zhang, Zizhan Guo, Hongbo Zhao, Yi Feng, Zuyi Xiong, Yue Wang, Shaoyi Du, Hanli Wang, Rui Fan

2025-11-05

Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Summary

This paper focuses on improving how computers understand 3D space from videos, specifically figuring out how far away objects are (depth) and how the camera is moving (ego-motion).

What's the problem?

Current methods for learning depth and ego-motion often treat figuring out camera movement as just a helpful extra step, not a core part of the process. They either mix all types of movement together or ignore certain kinds of camera rotations. This limits how well the system can use the natural rules of geometry, making it less reliable and accurate, especially in tricky situations like poor lighting or fast movements.

What's the solution?

The researchers developed a new approach that carefully separates the different parts of camera movement. They made the system first align the 'viewpoints' of consecutive frames in a video, then used the differences between these aligned views to specifically improve each part of the ego-motion estimate. This alignment also creates a situation where depth and camera translation can be calculated from each other using simple geometric formulas, adding extra checks and balances to make the depth estimation more accurate. They call their system DiMoDE.

Why it matters?

This work is important because it achieves better results than previous methods on standard tests and a new, challenging real-world dataset. By more accurately understanding depth and ego-motion, this research can help improve many applications like self-driving cars, robotics, and augmented reality, especially in difficult conditions where current systems struggle.

Abstract

Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

View Paper