arxiv_ai 93% Match Research Paper AI Researchers,ML Engineers,Audio AI Developers,Music Technologists 2 weeks ago

The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

speech-audio › music-ai

📄 Abstract

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

Authors (3)

Brandon James Carone

Iran R. Roman

Pablo Ripollés

Submitted

October 21, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Introduces the MUSE Benchmark, a novel resource with 10 tasks to rigorously evaluate music perception and auditory relational reasoning in audio LLMs. It reveals significant performance gaps between state-of-the-art models and human experts, highlighting persistent perceptual deficits and inconsistent benefits of Chain-of-Thought prompting.

Business Value

Enables developers to better understand and improve the music and audio understanding capabilities of AI systems, leading to more sophisticated music generation, analysis, and interactive audio experiences.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

High for the benchmark itself. Deployment of models that pass this benchmark would be feasible.

Limitations Addressed

Current evaluations of audio LLMs may obscure fundamental weaknesses in relational reasoning. MUSE provides a targeted evaluation for these specific capabilities.

Performance Gains

The paper reports performance gaps, not gains, showing current models are significantly below human expert levels on specific music perception and reasoning tasks.

Technical Tags

multimodal LLMsaudio understandingrelational reasoningbenchmarkmusic perceptionauditory reasoningstate-of-the-art modelshuman baselineChain-of-Thoughtinvariant representations

Research Topics

Multimodal AI EvaluationAudio UnderstandingAI Reasoning CapabilitiesMusic Information RetrievalBenchmark Development

Methods & Architectures

MUSE BenchmarkHuman evaluationChain-of-Thought prompting evaluation Multimodal Large Language Models (MLLMs)Gemini ProGemini FlashQwen2.5-OmniAudio-Flamingo 3

Applications & Tasks

Music AI Audio Analysis AI Evaluation Evaluating relational reasoning in audio LLMsAssessing music perception capabilitiesIdentifying weaknesses in multimodal models Probing music perceptionEvaluating auditory relational reasoningBenchmarking audio LLMs

Datasets & Benchmarks

Datasets

MUSE Benchmark

Benchmarks

MUSE Benchmark (10 tasks)

Performance varianceGap with human expertsChance performance

Related Fields

Artificial IntelligenceMachine LearningMusicologyCognitive ScienceNatural Language Processing

Keywords

MUSE benchmarkaudio LLMsmultimodal AImusic perceptionauditory reasoningrelational reasoningevaluationbenchmarkGeminiQwenAudio-FlamingoChain-of-Thought

Academic Context

#Multimodal AI Evaluation#Audio Understanding#AI Reasoning Capabilities#Music Information Retrieval#Benchmark Development

Commercial Potential

Potential Products

AI music composition toolsAdvanced audio analysis softwareMusic recommendation systemsInteractive music learning platforms

Target Industries

Music IndustryEntertainmentTechnologyAI Development

Use Case Examples

Evaluating AI's ability to understand musical structureDeveloping AI that can generate novel musical piecesAssessing AI's comprehension of sound relationships

Competitive Edge

Provides a specialized benchmark to differentiate and improve audio LLMs, particularly in complex reasoning tasks beyond basic audio recognition.

Market Opportunity

Growing market for AI in music creation and analysis.

Revenue Models

Indirectly through improved AI tools and services.

Resource Requirements

Compute Needs

Moderate for running evaluations on the benchmark. Training the evaluated models requires significant compute.

Data Requirements

The MUSE Benchmark dataset.

Deployment Constraints

Models need to be robust to variations in audio quality and musical styles.

Scalability

The benchmark itself is scalable as more tasks and data can be added. Model scalability depends on the underlying LLM architecture.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for models to significantly improve based on benchmark feedback.

Licensing

Open-source benchmark.

Patent Potential

Low for the benchmark itself, potentially moderate for novel model architectures developed based on insights.

View Full Paper Back to Papers