Video Depth without Video Models

Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler

2024-12-02

Summary

This paper introduces RollingDepth, a new method for estimating depth in videos by turning a model designed for single images into a powerful tool for analyzing video clips.

What's the problem?

Estimating depth from videos is tricky because traditional methods often rely on analyzing each frame separately. This can lead to problems like flickering effects and inconsistencies when the camera moves. Existing video depth estimation methods can be complex and require a lot of resources, making them less practical for long videos.

What's the solution?

RollingDepth addresses these challenges by using a single-image depth model to analyze short snippets of video (usually three frames at a time) and then combines these depth estimates into a consistent video. It uses an optimization technique to ensure that the depth information from different frames aligns well, resulting in smoother and more accurate depth maps across the entire video.

Why it matters?

This research is important because it improves how we can understand and analyze videos in three dimensions. By providing a more efficient and accurate way to estimate depth, RollingDepth can be applied in various fields such as robotics, virtual reality, and video editing, enhancing our ability to work with visual content.

Abstract

Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

View Paper