Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper AI Researchers,ML Engineers,Audio AI Developers,Music Technologists 2 weeks ago

The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

speech-audio β€Ί music-ai
πŸ“„ Abstract

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.
Authors (3)
Brandon James Carone
Iran R. Roman
Pablo RipollΓ©s
Submitted
October 21, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

Introduces the MUSE Benchmark, a novel resource with 10 tasks to rigorously evaluate music perception and auditory relational reasoning in audio LLMs. It reveals significant performance gaps between state-of-the-art models and human experts, highlighting persistent perceptual deficits and inconsistent benefits of Chain-of-Thought prompting.

Business Value

Enables developers to better understand and improve the music and audio understanding capabilities of AI systems, leading to more sophisticated music generation, analysis, and interactive audio experiences.