Multi-View 3D Point Tracking

Frano Rajič, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan Gündoğdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, Siyu Tang

2025-08-29

Summary

This paper introduces a new way to track specific points in 3D space within a moving scene, using information from multiple cameras simultaneously.

What's the problem?

Existing methods for tracking 3D points have limitations. Tracking with just one camera can be inaccurate because it's hard to judge depth and objects can get hidden. Using many cameras works, but requires a lot of cameras and a lot of fine-tuning for each video. Essentially, it's difficult to get reliable 3D tracking that works in real-time without a huge setup or sacrificing accuracy.

What's the solution?

The researchers created a system that uses a relatively small number of cameras – like four – to directly figure out where the same points are in the images from each camera. It combines information from all the cameras into a single 3D picture and then uses a technique similar to finding the closest matches, combined with a more advanced 'transformer' method, to keep track of points even when they're temporarily blocked from view. They trained this system using a lot of computer-generated videos and then tested it on real-world videos.

Why it matters?

This work is important because it provides a more practical and accurate way to track objects in 3D. It doesn't need a massive camera setup or a lot of manual adjustments, making it useful for things like robotics, virtual reality, and other applications where understanding how objects move in 3D space is crucial. By sharing their code and data, they hope to encourage further research in this area and make 3D tracking more accessible.

Abstract

We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.

View Paper