UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
Junwei Yu, Trevor Darrell, XuDong Wang
2025-11-18
Summary
This paper introduces UnSAMv2, a new method to improve the Segment Anything Model (SAM) so it can create segmentations – outlines around objects in images – at any level of detail without needing humans to label data.
What's the problem?
The original Segment Anything Model, while powerful, struggles with controlling how detailed the segmentations are. Users often have to manually adjust the results by providing more instructions or choosing from several options, which can be confusing because the same instruction can lead to different possible outlines. Getting enough labeled data to train the model to handle all levels of detail is also very expensive and impractical.
What's the solution?
The researchers built upon a previous method called UnSAM, and they focused on finding many examples of how different levels of detail (granularity) relate to the objects being segmented. They then created a new way to tell the model what level of detail to use, allowing for precise control over the segmentation scale. They only needed a small amount of unlabeled image data – about 6,000 images – and added a very small number of extra parameters to the original SAM-2 model to achieve these improvements.
Why it matters?
This work is important because it shows that you can significantly improve powerful AI models like SAM with a relatively small amount of unlabeled data and a clever learning technique. It unlocks the full potential of these foundation models by allowing them to segment images at any desired level of detail, which is useful for a wide range of tasks like image editing, object recognition, and video analysis, and it does so without the costly need for extensive human labeling.
Abstract
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only 6K unlabeled images and 0.02% additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over 11 benchmarks, UnSAMv2 improves NoC_{90} (5.69 rightarrow 4.75), 1-IoU (58.0 rightarrow 73.1), and AR_{1000} (49.6 rightarrow 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.