Multimodal Referring Segmentation: A Survey

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang

2025-08-04

Multimodal Referring Segmentation: A Survey

Summary

This paper talks about multimodal referring segmentation, which is a technology that helps computers find and separate specific objects in images, videos, or 3D scenes by following instructions given in text or audio form.

What's the problem?

The problem is that it’s challenging for computers to understand both the visual content and the referring instructions at the same time, especially because these instructions can be in different formats and the visual scenes can be complex and varied.

What's the solution?

The paper surveys different methods that use advanced tools like convolutional neural networks, transformers, and large language models to improve how well computers can combine visual data and language or audio instructions to accurately identify and segment the right objects in different types of scenes.

Why it matters?

This matters because it makes technologies like image editing, video analysis, robotics, and autonomous driving smarter and more user-friendly by allowing users to interact naturally using language or speech to pick out objects in complex environments.

Abstract

A survey of multimodal referring segmentation techniques, covering advancements in convolutional neural networks, transformers, and large language models for segmenting objects in images, videos, and 3D scenes based on text or audio instructions.

View Paper