EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang, Tianheng Cheng, Rui Hu, ei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

2024-07-01

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Summary

This paper talks about EVF-SAM, a new method designed to improve the Segment Anything Model (SAM) by enabling it to understand text prompts better. It combines visual and textual information to help the model accurately segment images based on what the user describes.

What's the problem?

While SAM is great at segmenting images when given visual prompts (like points or boxes), it has not been fully explored how well it can handle text prompts. This limitation means that users who want to describe what they want segmented using words might not get accurate results, making it less flexible for various applications.

What's the solution?

To solve this problem, the authors introduced EVF-SAM, which uses a combination of images and text prompts to improve segmentation accuracy. They employed a pre-trained vision-language model that can understand both visual and textual inputs. Their experiments showed that using multimodal prompts (both image and text) and an early fusion approach significantly enhances SAM's ability to accurately segment objects based on text descriptions. The new method also reduces the number of parameters needed by about 82% compared to older models, making it more efficient.

Why it matters?

This research is important because it expands the capabilities of SAM by allowing it to respond to text prompts effectively. This improvement can make SAM more useful in real-world scenarios where users want to interact with the model using natural language, such as in applications for image editing, data labeling, or any task that requires precise object segmentation based on descriptions.

Abstract

Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.

View Paper