3AM: Segment Anything with Geometric Consistency in Videos

Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu

2026-01-14

3AM: Segment Anything with Geometric Consistency in Videos

Summary

This paper introduces a new method, 3AM, to improve video object segmentation, which is the process of identifying and outlining specific objects in a video. It builds upon an existing powerful segmentation model, SAM2, to make it more reliable when objects move and change their appearance from different viewpoints.

What's the problem?

Current video object segmentation methods often struggle when an object's viewpoint changes drastically. Models like SAM2 are good at recognizing objects based on how they *look*, but this isn't enough when the object rotates or is seen from a completely different angle. Existing 3D segmentation methods are more viewpoint-consistent, but they require extra information like camera positions and depth data, which isn't always available or easy to obtain, and they take a lot of processing power.

What's the solution?

The researchers developed 3AM, which enhances SAM2 by adding a '3D awareness' component. They take information from another model called MUSt3R, which understands the spatial relationships between objects, and cleverly combine it with SAM2's visual recognition abilities. This combination allows the model to recognize objects based on *both* their appearance and their position in 3D space. They also developed a way to select video frames that show consistent parts of the object, making it easier to learn these 3D relationships. Importantly, 3AM only needs regular video (RGB) as input – no extra data is required during actual use.

Why it matters?

This work is significant because it achieves much better performance on challenging video datasets compared to existing methods, especially when objects undergo large movements. It improves accuracy significantly without needing complex setups or extra information like camera positions, making it more practical for real-world applications like robotics, augmented reality, and video editing.

Abstract

Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/

View Paper