< Explain other AI papers

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin

2025-04-11

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Summary

This paper talks about TAPNext, a new way for computers to track the movement of any point in a video, like following a dot or an object as it moves from frame to frame. TAPNext treats tracking as a prediction problem, where the computer guesses the next position of the point, kind of like predicting the next word in a sentence.

What's the problem?

The problem is that most current tracking methods are complicated and rely on a lot of special tricks and rules that make them hard to use for different tasks or to improve as technology advances. These methods can be slow and aren't very flexible, which limits their usefulness in real-world applications like robotics or video editing.

What's the solution?

TAPNext solves this by simplifying the tracking process. Instead of using all those special tricks, it trains a model to predict the next location of a point in a video, step by step, using a method similar to how language models predict the next word. This makes the system faster, easier to scale, and able to work in real time without needing to look at a bunch of frames at once. The researchers also found that, through training, TAPNext naturally learns some of the smart tracking behaviors that other systems have to be programmed to do.

Why it matters?

This work matters because it makes tracking objects in videos much simpler, faster, and more adaptable. With TAPNext, it's easier to build systems for things like robots, video editing tools, or 3D reconstruction that need to follow moving points accurately, opening up new possibilities for technology and creative projects.

Abstract

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.