OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang

2025-12-30

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Summary

This paper introduces OmniAgent, a new system that's really good at understanding both audio and video together, going beyond what current AI models can do.

What's the problem?

Existing AI models that handle both audio and video often struggle to truly connect what they 'hear' and 'see'. They don't always understand the subtle relationships between sounds and visuals, and they can have trouble pinpointing exactly *when* and *where* important things are happening in a video based on the audio. Many rely on simply describing everything in a video, which isn't efficient or insightful.

What's the solution?

The researchers created OmniAgent, which doesn't just passively watch and listen. Instead, it actively 'investigates' using different tools, guided by the audio. Think of it like a detective following clues – the audio tells it where to look and what to focus on. It plans its actions dynamically, only using tools when needed, and starts with a broad understanding based on sound before zooming in for details. This 'coarse-to-fine' approach, driven by audio, is key to its success.

Why it matters?

OmniAgent represents a significant improvement in how AI understands the world through multiple senses. It achieves much higher accuracy than other models on tasks requiring audio-video understanding, improving performance by 10-20%. This is important because it moves us closer to AI systems that can truly 'see' and 'hear' like humans, which has applications in areas like robotics, video analysis, and assistive technology.

Abstract

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

View Paper