AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue
2024-12-04
Summary
This paper introduces the AV-Odyssey Bench, a new benchmark designed to test whether multimodal large language models (MLLMs) can effectively understand and process audio-visual information.
What's the problem?
Even though multimodal LLMs like GPT-4o and Gemini 1.5 Pro can handle various tasks involving both audio and visual data, they often struggle with simple tasks that humans find easy, such as identifying which sound is louder or has a higher pitch. This indicates that while these models are advanced, they may not truly understand audio-visual information as well as expected.
What's the solution?
To investigate this issue, the researchers created the AV-Odyssey Bench, which includes 4,555 carefully designed questions that require models to use clues from text, images, and audio together to find the correct answers. The questions are structured as multiple-choice to allow for objective evaluation without needing human judges. This benchmark helps assess the models' ability to integrate and understand information from different sources effectively.
Why it matters?
This research is important because it sheds light on the limitations of current multimodal LLMs in understanding complex audio-visual tasks. By providing a comprehensive evaluation framework, AV-Odyssey Bench can help improve future models and datasets, ultimately leading to better AI systems that can understand and interact with the world in a more human-like way.
Abstract
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.