Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
2024-07-17

Summary
This paper introduces Ref-AVS, a new task that focuses on segmenting objects in audio-visual scenes using both audio and visual cues, along with natural language descriptions.
What's the problem?
Most existing methods for segmenting objects in images only consider visual information and often ignore the sounds associated with those objects. This limits the ability to understand complex scenes where audio plays an important role, such as identifying which object is making a sound in a video. Traditional approaches do not effectively use the rich information provided by both audio and visual elements.
What's the solution?
To solve this problem, the authors created a new benchmark called Ref-AVS, which includes pixel-level annotations for objects described by multimodal cues, like audio and visual descriptions. They developed a method that leverages these multimodal cues to improve the accuracy of object segmentation. The approach allows models to understand and process audio alongside visual data, enhancing their ability to identify and segment relevant objects based on natural language prompts.
Why it matters?
This research is significant because it addresses the need for more comprehensive models that can interpret real-world scenarios where multiple senses are involved. By integrating audio and visual information, Ref-AVS can improve how machines understand complex environments, which has applications in areas like robotics, video analysis, and interactive media. This could lead to better AI systems that can assist in tasks requiring a deeper understanding of both sight and sound.
Abstract
Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}.