Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guanquan Jie, Yu-Gang Jiang

2025-07-31

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual
Segmentation

Summary

This paper talks about OmniAVS and OISA, two new AI systems that work together to improve audio-visual segmentation, which means identifying and separating objects in videos that make sounds, using complex information from different types of data and reasoning abilities from large AI models.

What's the problem?

The problem is that current systems struggle to handle complicated expressions that combine different types of data like text, audio, and visuals, and they often can't reason deeply to accurately separate and identify sound-producing objects in videos.

What's the solution?

OmniAVS improves audio-visual segmentation by allowing the AI to understand more complex instructions and expressions across different types of data, while OISA uses a powerful multimodal large language model to guide and reason about the segmentation process, making it smarter and more flexible.

Why it matters?

This matters because better audio-visual segmentation with reasoning helps AI better understand and interact with the world, improving applications like video editing, surveillance, and human-computer interaction where knowing what is making sounds in a scene is crucial.

Abstract

Omnimodal Referring Audio-Visual Segmentation (OmniAVS) and Omnimodal Instructed Segmentation Assistant (OISA) advance audio-visual segmentation by integrating complex multimodal expressions and leveraging MLLM for reasoning-based segmentation.

View Paper