Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen, Shaurya Dewan, Stan Birchfield

2025-12-15

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Summary

This paper introduces a new approach to stereo vision, which is how computers 'see' depth like our eyes do, aiming for both accuracy and speed.

What's the problem?

Existing stereo vision systems face a trade-off: highly accurate systems are too slow for real-time use, like in self-driving cars or robotics, while faster systems aren't very reliable and need a lot of specific training for each new situation. Basically, getting both good performance *and* speed has been a major challenge.

What's the solution?

The researchers developed 'Fast-FoundationStereo,' a system that combines several techniques to overcome this challenge. They started with a powerful, but slow, existing model and 'compressed' it using a method called knowledge distillation. Then, they automatically designed the best way to filter information to speed things up, using a technique called neural architecture search. Finally, they removed unnecessary parts of the system through structured pruning. To improve training, they also created a large dataset of real-world stereo images using a smart labeling process.

Why it matters?

This work is important because it achieves a significant speed increase – over 10 times faster – without sacrificing accuracy. This makes high-quality stereo vision practical for applications that require real-time processing, opening up possibilities for advancements in areas like autonomous vehicles, augmented reality, and robotics.

Abstract

Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/

View Paper