See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

2025-12-10

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Summary

This paper introduces a new way to test how well artificial intelligence can understand videos, specifically focusing on connecting what people *say* with *who* is saying it and *when* they say it.

What's the problem?

Current tests for AI understanding of videos don't really challenge the AI to deeply understand speech. Many tasks can be solved just by looking at the visuals, or they only check if the AI generally understands the speech, not the specifics of who said what at a particular moment. This means we don't know if AI can truly connect the audio and visual information in a meaningful way, like humans do.

What's the solution?

The researchers created a new benchmark called AV-SpeakerBench. It contains over 3,000 multiple-choice questions about real-world videos, designed to specifically test if an AI can figure out who is speaking, what they are saying, and when they are saying it. The questions are carefully designed to require understanding both the audio and video together. They then tested several AI models, including Google’s Gemini and an open-source model called Qwen3-Omni-30B, on this benchmark.

Why it matters?

This work is important because it provides a much more difficult and realistic test for AI video understanding. It shows that while some models, like Gemini 2.5 Pro, are getting better at this kind of reasoning, many others still struggle, particularly with combining audio and visual information. This new benchmark will help researchers develop AI systems that can truly understand videos like humans do, which is crucial for applications like robotics, self-driving cars, and video analysis.

Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

View Paper