The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

2025-12-09

The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Summary

This paper explores the big differences between two versions of an image segmentation model, called SAM2 and SAM3, and explains why skills learned from using SAM2 don't automatically work with SAM3.

What's the problem?

The newest version, SAM3, works very differently than the older version, SAM2. SAM2 focuses on precisely outlining objects based on where you *show* it what to segment – like clicking on a point or drawing a box. SAM3, however, understands what you *tell* it to segment using words, like 'cat' or 'car'. This means the techniques used to get the best results from SAM2 don't translate to SAM3 because they operate on fundamentally different principles.

What's the solution?

The researchers broke down the differences into five key areas. First, they compared how each model understands instructions – spatial cues versus text descriptions. Second, they detailed the internal structure of each model, showing SAM3 is much more complex with components for processing both images and language. Third, they looked at the data used to train each model, noting SAM3 used data with more conceptual labels. Fourth, they explained why the best settings for training SAM2 don't work for SAM3. Finally, they compared how each model is tested and what metrics are used to measure success, highlighting a shift from measuring geometric accuracy to semantic understanding.

Why it matters?

This work shows that SAM3 isn't just an improvement on SAM2, it's a whole new type of segmentation model. It marks a shift towards models that can understand concepts and segment images based on what things *are*, rather than just where they *are*. This research helps point the way for future development in this area of artificial intelligence.

Abstract

This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.

View Paper