Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Despite rapid progress in Multi-modal Large Language Models and Large
Audio-Language Models, existing audio benchmarks largely test semantics that
can be recovered from text captions, masking deficits in fine-grained
perceptual reasoning. We formalize audio 4D intelligence that is defined as
reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to
measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six
attributes under absolute and relative regimes) with a Holistic Spatio-Temporal
Reasoning setting that includes segment reordering for continuous and discrete
processes and spatial tasks spanning static localization, multi-source
relations, and dynamic trajectories. Our data curation pipeline uses two
methods to ensure high-quality samples. For foundational tasks, we use
procedurally synthesized and physics-simulated audio. For holistic data, we
follow a four-stage process that includes human annotation and final selection
based on human performance. Unlike prior benchmarks where caption-only
answering reduces accuracy slightly, STAR-Bench induces far larger drops
(-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically
hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared
with humans and a capability hierarchy: closed-source models are bottlenecked
by fine-grained perception, while open-source models lag across perception,
knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear
path forward for developing future models with a more robust understanding of
the physical world.
Authors (13)
Zihan Liu
Zhikang Niu
Qiuyang Xiao
Zhisheng Zheng
Ruoqi Yuan
Yuhang Zang
+7 more
Submitted
October 28, 2025
Key Contributions
This paper formalizes 'audio 4D intelligence' – reasoning over sound dynamics in time and 3D space – and introduces STAR-Bench, a novel benchmark designed to measure this capability. STAR-Bench includes foundational acoustic perception tasks and holistic spatiotemporal reasoning tasks (segment reordering, localization, dynamic trajectories), using procedurally synthesized, physics-simulated, and human-annotated data. This benchmark aims to probe deeper perceptual reasoning in audio models beyond simple semantic understanding.
Business Value
Provides a standardized and rigorous way to evaluate and improve AI's ability to understand complex sound environments, crucial for applications like advanced robotics, surveillance, and immersive audio experiences.