Local All-Pair Correspondence for Point Tracking

Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, Joon-Young Lee

2024-07-23

Local All-Pair Correspondence for Point Tracking

Summary

This paper introduces LocoTrack, a new model designed for accurately tracking points across video sequences. It aims to improve the way points are followed in videos, addressing issues that previous models faced when dealing with similar-looking areas or repetitive features.

What's the problem?

Many existing methods for tracking points in videos rely on comparing images using local 2D maps, which can lead to problems when the images have similar patterns or features. This can cause confusion and make it hard to accurately match points between frames, especially in areas where there isn’t much difference visually. As a result, these methods often struggle to maintain consistent tracking over time.

What's the solution?

LocoTrack solves this problem by using a new approach called local all-pair correspondence, which looks at all possible matches across regions in a more comprehensive way. This method creates a local 4D correlation that helps establish precise connections between points in different frames. Additionally, LocoTrack uses a lightweight correlation encoder to improve speed and efficiency, and it incorporates a compact Transformer architecture to keep track of information over longer periods. This combination allows LocoTrack to achieve high accuracy while operating nearly six times faster than existing top models.

Why it matters?

This research is important because it enhances the ability of AI systems to track moving points in videos more accurately and efficiently. Improved tracking can have significant applications in various fields, such as video surveillance, autonomous driving, and sports analysis, where understanding motion is crucial. By addressing the limitations of previous methods, LocoTrack sets a new standard for point tracking technology.

Abstract

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.

View Paper