Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, Bingyi Kang
2025-01-22
Summary
This paper talks about a new AI system called Video Depth Anything that can figure out how far away things are in long videos. It's like giving a computer the ability to see in 3D while watching a movie.
What's the problem?
Current AI systems are pretty good at guessing distances in single pictures, but they struggle when it comes to videos. The main issue is that their guesses jump around a lot from frame to frame, which makes the depth information look jittery and unrealistic. Some people have tried to fix this, but their solutions only work for short videos and either sacrifice quality or take too long to process.
What's the solution?
The researchers created Video Depth Anything by improving an existing system called Depth Anything V2. They added a special part to the AI that looks at how things move over time in the video. They also came up with a clever way to teach the AI to keep its distance guesses consistent between frames. To handle really long videos, they developed a strategy that focuses on key frames and then fills in the rest. They trained their AI using a mix of video clips with known depth information and regular images.
Why it matters?
This matters because it could make a big difference in how computers understand and interact with videos. Imagine watching a YouTube video and being able to click on any object to see exactly how far away it is, or having virtual reality experiences that feel more realistic because the depth perception is spot-on. It could help with things like self-driving cars, augmented reality games, or even making special effects in movies. The fact that it works on super long videos and can run in real-time on some devices means it could be used in many real-world applications where understanding depth in videos is important.
Abstract
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.