SpatialTrackerV2 achieves significant improvements by jointly learning geometry and motion, outperforming all prior 3D tracking methods by a clear margin. Additionally, it delivers strong results in 2D tracking and dynamic 3D reconstruction. The model consists of two main components: a VGGT-style network that extracts high-level semantic features from the input video to initialize consistent scene geometry and camera motion, and a track refiner that iteratively updates all 4D attributes, including 2D and 3D point tracking, trajectory-wise dynamic probabilities, and camera poses.
SpatialTrackerV2 presents qualitative results across diverse scenarios, with all results generated by the model in a purely feed-forward manner, taking only 10-20 seconds per sequence. The model's ability to estimate camera motion, consistent geometry, and pixel-wise 3D trajectories at once makes it a powerful tool for various applications. With its scalable training and strong performance, SpatialTrackerV2 has the potential to advance the field of 3D point tracking and related areas.