Key Features

Unified, end-to-end 3D point tracking model
Estimates camera motion, consistent geometry, and pixel-wise 3D trajectories
Fully differentiable architecture
Scalable training across diverse data sources
Jointly learns geometry and motion
Outperforms prior 3D tracking methods
Delivers strong results in 2D tracking and dynamic 3D reconstruction
Fast inference time (10-20 seconds per sequence)

SpatialTrackerV2 achieves significant improvements by jointly learning geometry and motion, outperforming all prior 3D tracking methods by a clear margin. Additionally, it delivers strong results in 2D tracking and dynamic 3D reconstruction. The model consists of two main components: a VGGT-style network that extracts high-level semantic features from the input video to initialize consistent scene geometry and camera motion, and a track refiner that iteratively updates all 4D attributes, including 2D and 3D point tracking, trajectory-wise dynamic probabilities, and camera poses.


SpatialTrackerV2 presents qualitative results across diverse scenarios, with all results generated by the model in a purely feed-forward manner, taking only 10-20 seconds per sequence. The model's ability to estimate camera motion, consistent geometry, and pixel-wise 3D trajectories at once makes it a powerful tool for various applications. With its scalable training and strong performance, SpatialTrackerV2 has the potential to advance the field of 3D point tracking and related areas.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!