arxiv_ai 95% Match Research Paper AI researchers,ML engineers,Developers of multimodal systems,Benchmark creators 2 weeks ago

MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels

large-language-models › multimodal-llms

📄 Abstract

Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

Authors (9)

Chen Chen

ZeYang Hu

Fengjiao Chen

Liya Ma

Jiaxing Liu

Xiaoyu Li

+3 more

Submitted

October 21, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces MMAO-Bench, a novel, high-quality benchmark for evaluating Multimodal Large Language Models (Omni Models). It assesses both uni-modal and omni-modal understanding capabilities and reveals the compositional law between them, highlighting how uni-modal performance influences omni-modal capabilities and identifying bottleneck effects.

Business Value

Provides a standardized way to measure and compare the progress of multimodal AI, accelerating development and identifying areas for improvement in creating more capable and versatile AI systems.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

The benchmark itself is a research artifact and doesn't have direct deployment feasibility, but it guides the development of deployable multimodal models.

Limitations Addressed

Lack of comprehensive evaluation benchmarks for multimodal LLMs and the unclear relationship between their uni-modal and omni-modal performance.

Technical Tags

Multimodal LLMsOmni ModelsBenchmarkUni-modal UnderstandingOmni-modal UnderstandingCompositional LawReasoningVisual UnderstandingAudio UnderstandingLanguage Understanding

Research Topics

Multimodal AILarge Language ModelsArtificial Intelligence EvaluationComputer VisionSpeech Processing

Methods & Architectures

Benchmark creationMulti-step open-ended question answeringHuman curation Multimodal Large Language Models (Omni Models)

Applications & Tasks

Artificial Intelligence Human-Computer Interaction Robotics Content Analysis Unclear correlation between uni-modal and omni-modal capabilitiesLack of comprehensive benchmarks for omni modelsAssessing complex reasoning in multimodal settings Evaluating uni-modal understandingEvaluating omni-modal understandingAssessing compositional reasoningDriving intelligence evolution of omni models

Datasets & Benchmarks

Datasets

MMAO-Bench

Benchmarks

MMAO-Bench (1880 samples, 44 task types)

Related Fields

Computer VisionNatural Language ProcessingSpeech ProcessingArtificial IntelligenceMachine Learning

Keywords

Multimodal LLMsOmni ModelsBenchmarkEvaluationUni-modalOmni-modalCompositional LawReasoningVisual AIAudio AILanguage AIAI Development

Academic Context

#Multimodal AI#Large Language Models#Artificial Intelligence Evaluation#Computer Vision#Speech Processing

Commercial Potential

Potential Products

More capable multimodal assistantsAdvanced AI for content creation and analysisRobots with enhanced environmental understanding

Target Industries

TechnologyMediaRoboticsAutomotiveHealthcare

Use Case Examples

An AI that can describe an image, answer questions about its content, and understand accompanying audio.Developing AI for autonomous vehicles that process visual, auditory, and textual information simultaneously.Creating AI tools for analyzing complex multimedia documents.

Competitive Edge

Establishes a new standard for evaluating multimodal LLMs, enabling more rigorous comparison and driving progress in the field.

Market Opportunity

Rapid growth in the multimodal AI market.

Revenue Models

N/A (Benchmark)

Resource Requirements

Compute Needs

N/A (Benchmark)

Data Requirements

N/A (Benchmark)

Deployment Constraints

N/A (Benchmark)

Scalability

N/A (Benchmark)

Regulatory Considerations

Data privacy and ethical considerations for multimodal data.

Production Readiness

Maturity Level

Research/Development

Time to Market

N/A (Benchmark)

Patent Potential

Low (Benchmark creation)

View Full Paper Back to Papers