SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer
2024-08-02

Summary
This paper presents SAM 2, a new model designed to improve the way computers can identify and segment objects in both images and videos. It uses advanced techniques to make the process faster and more accurate.
What's the problem?
Many existing models for segmenting objects in videos struggle with challenges like motion, changes in lighting, and occlusions (when one object blocks another). These issues can lead to errors in identifying objects, especially when the video quality is low or when objects move quickly. Additionally, previous models often required many user interactions to achieve good results, making them less efficient.
What's the solution?
To address these challenges, the authors developed SAM 2, which is built on a simple transformer architecture that allows for real-time video processing. SAM 2 improves accuracy in video segmentation while using three times fewer interactions than earlier methods. It also processes images six times faster than its predecessor, the original Segment Anything Model (SAM). The model incorporates a memory system that helps it remember information from previous frames, enabling it to track objects more effectively across a video.
Why it matters?
This research is significant because it enhances the capabilities of AI in understanding and processing visual content. By improving the speed and accuracy of object segmentation in images and videos, SAM 2 can be applied in various fields such as video editing, medical imaging, and autonomous vehicles. This advancement could lead to more efficient workflows and better user experiences in applications that rely on visual data.
Abstract
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model, the dataset and an interactive demo.