Efficient Track Anything

Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra

2024-12-03

Summary

This paper presents EfficientTAMs, a new approach to video object segmentation that makes it easier and faster to track objects in videos, especially on devices with limited processing power.

What's the problem?

Tracking and segmenting objects in videos can be very computationally demanding, which makes it difficult to use advanced models on devices like smartphones or robots. Existing models, such as Segment Anything Model 2 (SAM 2), perform well but require a lot of resources and can be too complex for everyday use, limiting their application in real-world scenarios.

What's the solution?

EfficientTAMs addresses these issues by using a simpler type of image encoder called a Vision Transformer (ViT) and an efficient memory module that reduces the amount of computation needed for tracking objects across video frames. This new model allows for high-quality results while being much faster and smaller in size compared to existing models. The researchers trained EfficientTAMs on large datasets and tested them on various benchmarks, finding that they performed nearly as well as the more complex SAM 2 model but with significantly improved speed and reduced resource requirements.

Why it matters?

This research is important because it enables effective object tracking in videos on devices that previously couldn't handle such tasks due to hardware limitations. By making these techniques more accessible, EfficientTAMs can be used in numerous applications like mobile apps, robotics, and augmented reality, enhancing how we interact with video content in everyday life.

Abstract

Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

View Paper