STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang

2025-10-29

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Summary

This paper introduces a new way to test how well AI understands audio, going beyond just recognizing what sounds *are* and focusing on how sounds change over time and space.

What's the problem?

Current audio tests mostly check if AI can understand the general meaning of a sound, which is often easily described in text. This doesn't really test if the AI can understand subtle details in how sounds evolve – things like where a sound is coming from, how it's moving, or small changes in the sound itself. Basically, existing tests don't challenge AI to truly 'listen' and reason about the physical world based on sound.

What's the solution?

The researchers created a new benchmark called STAR-Bench. It tests two main things: basic sound qualities like loudness and pitch, and more complex reasoning about sounds in 3D space and over time. They used both computer-generated sounds and sounds recorded and labeled by people to make sure the tests were high quality. They then tested 19 different AI models on this benchmark.

Why it matters?

This work shows that current AI models, even advanced ones, still struggle with understanding the nuances of sound. It highlights a clear area for improvement in AI development – building models that can better perceive and reason about the physical world through sound, which is crucial for things like robotics, self-driving cars, and assistive technologies.

Abstract

Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

View Paper