SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin
2025-11-21
Summary
This paper focuses on improving the ability of computer vision systems to accurately identify and track objects within surgical videos, which is important for assisting surgeons during operations.
What's the problem?
Current state-of-the-art models, like Segment Anything Model 2 (SAM2), are good at identifying objects in general, but struggle in the specific environment of a surgery because surgical videos look very different from typical images and it's hard to keep track of objects throughout a long surgery. Existing datasets also aren't large enough or detailed enough to properly train and test these models for surgical use.
What's the solution?
The researchers created a new, large dataset called SA-SV containing a lot of surgical video frames with detailed annotations of instruments and tissues. They then built a new model, SAM2S, which improves upon SAM2 by adding a 'memory' to help it remember objects over time, a way to better understand surgical instruments, and a method to deal with inconsistencies in the annotations. Essentially, they made SAM2 more 'surgery-aware'.
Why it matters?
This work is important because better surgical video segmentation can lead to more precise computer-assisted surgery, potentially improving patient outcomes. The new dataset and model provide a significant step forward in this field, and the improvements in tracking and accuracy could make these systems more reliable and useful in real-world operating rooms.
Abstract
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average J\&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J\&F, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.