SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee
2025-05-23
Summary
This paper talks about SAKURA, a new way to test how well large models that understand both audio and language can solve problems that require thinking through several steps using information from speech and sounds.
What's the problem?
The problem is that these models often have a hard time putting together different pieces of information from audio, like speech or other sounds, to answer questions or solve tasks that need more than just one simple step of reasoning.
What's the solution?
The researchers created SAKURA, which is a special benchmark designed to check how good these models are at multi-hop reasoning, meaning they have to connect several clues from audio to figure out the answer. The results show that current models still struggle with this kind of complex thinking when it comes to audio information.
Why it matters?
This is important because it helps highlight what needs to be improved in audio-language models so they can become better at understanding and reasoning with sounds, which is useful for things like virtual assistants, accessibility tools, and analyzing audio recordings.
Abstract
SAKURA is introduced to evaluate the multi-hop reasoning abilities of large audio-language models, revealing their struggles in integrating speech/audio representations.