The system builds on Wan2.1-T2V-1.3B as a pretrained video diffusion transformer and adapts it through training stages involving DiT LoRA, input and output projections, and VAE components. It trains on synthetic datasets such as Kubric, Dynamic Replica, PointOdyssey, and TartanAir, using rendered sequences and depth or camera supervision to learn dense 3D motion. This lets the model produce point trajectories and visibility estimates over time.
TrackCraft3R is useful for 3D scene understanding, robotics perception, dynamic reconstruction, augmented reality, and research on reusing generative video priors for geometric tasks. Its value is that a model originally designed for video generation can be converted into a dense tracker, showing that diffusion transformers encode useful motion and spatial structure. Because the submitted URL is a GitHub repository with official code, it is listed as free and open-source.


