Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

2025-01-08

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Summary

This paper talks about Sa2VA, a new AI model that can understand both images and videos really well. It's special because it can do many different tasks with images and videos, like identifying specific objects and having conversations about what it sees.

What's the problem?

Current AI models that work with images and videos are often limited. They might be good at one specific task or only work with either images or videos, but not both. This makes it hard to create AI that can understand visual content as well as humans do.

What's the solution?

The researchers created Sa2VA by combining two existing AI models: SAM-2 (which is good at identifying objects in videos) and LLaVA (which is good at understanding language and images). They made these models work together in a clever way, allowing Sa2VA to understand text, images, and videos all at once. They also created a new dataset called Ref-SAV with lots of labeled video clips to help train and test the model.

Why it matters?

Sa2VA is a big step forward in making AI that can understand visual content more like humans do. It could be used in many real-world applications, like helping robots navigate, improving video editing software, or creating better virtual assistants that can see and understand what's happening around them. By being able to work with both images and videos, and do many different tasks, Sa2VA could make it easier and more efficient to create AI systems for a wide range of uses.

Abstract

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

View Paper