Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing Lyu, Chun Yuan, Fengyun Rao
2026-02-24
Summary
This paper tackles the problem of finding the same object in a video when the camera viewpoint changes dramatically, specifically going between a first-person (egocentric, like a GoPro view) and a third-person (exocentric, like a regular camera view) perspective, and vice versa.
What's the problem?
It's really hard for computers to recognize the same object when seen from completely different angles. Imagine trying to find your friend in a crowd – it's easy if you're looking at them directly, but much harder if you only have a description and have to scan the whole scene. Existing methods struggle with these big viewpoint shifts, and often need a lot of labeled data to learn, which is expensive and time-consuming to create.
What's the solution?
The researchers came up with a method that uses a 'query mask' – basically, you tell the computer 'find something that looks like *this* shape' in the new video. This shape is then used to guide the computer to locate the corresponding object. To make this work well regardless of the viewpoint, they added a 'cycle-consistency' trick. This means the computer not only finds the object in the new view, but also tries to recreate the original shape from that new view. This forces the computer to learn a more general understanding of the object, rather than just memorizing how it looks from one specific angle. They also used a technique called 'test-time training' where the model continues to learn while it's actually being used, improving its performance on the fly.
Why it matters?
This research is important because it improves the ability of computers to understand videos from different perspectives. This has applications in areas like robotics (a robot needs to understand what *you* are pointing at, even if it sees the world differently), augmented reality (accurately tracking objects in a scene), and video analysis (understanding actions and interactions in complex environments). The fact that their method doesn't require a ton of labeled data and can even improve itself during use makes it particularly practical.
Abstract
We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.