Self-Supervised Any-Point Tracking by Contrastive Random Walks
Ayush Shrivastava, Andrew Owens
2024-09-26

Summary
This paper introduces a new method for tracking any point in videos called Self-Supervised Any-Point Tracking by Contrastive Random Walks. It uses advanced machine learning techniques to improve how we can follow specific points in video footage over time.
What's the problem?
Tracking points in videos can be difficult because traditional methods often require a lot of manual work and can struggle with changes in the scene, like occlusions (when something blocks the view) or varying lighting. Existing approaches may also be complicated and not very efficient, making it hard to track points accurately across longer video clips.
What's the solution?
The researchers developed a self-supervised approach that uses a transformer model to match points in videos through something called contrastive random walks. Instead of trying to learn everything at once, the model compares all pairs of points to find consistent paths through the video. This allows it to track points more accurately without needing a lot of complex setups or manual adjustments. They also introduced design choices that help the model avoid shortcut solutions that could lead to errors.
Why it matters?
This research is important because it makes tracking points in videos more efficient and accurate, which can be useful for many applications like video analysis, robotics, and augmented reality. By improving how we track motion in videos, this method can help us better understand actions and interactions in various fields, from sports to surveillance.
Abstract
We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.