TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu

2025-12-10

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Summary

This paper introduces a new method, called TrackingWorld, for tracking the movement of objects in 3D space using only a single camera video. It aims to create a complete 3D map of everything moving in the scene over time.

What's the problem?

Current methods for 3D tracking from a single camera struggle with two main issues. First, they have trouble distinguishing between the camera's own movement and the movement of objects in the scene. Second, they often fail to track objects that appear *later* in the video, like someone walking into the frame. They can't easily pick up and track these new, dynamic elements.

What's the solution?

TrackingWorld solves these problems in a few steps. It starts by 'filling in the gaps' in sparse tracking data to create a more complete 2D tracking map. Then, it applies this enhanced 2D tracking to every frame of the video, removing redundant tracks where they overlap. Finally, it uses a mathematical process to convert these dense 2D tracks into accurate 3D trajectories, figuring out both where the camera was and where the objects were in 3D space at each moment.

Why it matters?

This research is important because it allows for more accurate and complete 3D tracking using just a standard video camera. This has potential applications in areas like robotics, virtual reality, and creating 3D models of real-world environments, all without needing expensive or complex 3D sensors.

Abstract

Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.

View Paper