arxiv_cv 92% Match Research Paper Robotics Researchers,AI Researchers,MLLM Developers,Embodied AI Engineers,Benchmark Creators 2 weeks ago

Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

robotics › manipulation

📄 Abstract

Abstract: Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.

Authors (21)

Yulin Luo

Chun-Kai Fan

Menghang Dong

Jiayu Shi

Mengdi Zhao

Bo-Wen Zhang

+15 more

Submitted

October 20, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

RoboBench is a comprehensive benchmark designed to systematically evaluate Multimodal Large Language Models (MLLMs) as the 'embodied brain' for robots. It addresses the limitations of existing benchmarks by focusing on five critical dimensions across the manipulation pipeline (instruction comprehension, perception, reasoning, planning, and action) and incorporating task realism, providing a more holistic assessment of cognitive capabilities.

Business Value

Accelerates the development of more capable and reliable robots by providing a standardized and comprehensive evaluation framework for their AI 'brains', crucial for widespread adoption in industry and homes.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

N/A (This is a benchmark, not a deployable model).

Limitations Addressed

Lack of systematic evaluation for MLLMs as embodied brains,Existing benchmarks emphasizing execution success over reasoning,Incomplete dimensions and limited task realism in current reasoning benchmarks,Partial picture of cognitive capability in embodied AI

Technical Tags

benchmarkembodied AImultimodal LLMsroboticsmanipulation tasksreasoninginstruction comprehensionperceptiondecision-makingevaluation

Research Topics

Embodied AIRoboticsMultimodal LearningLarge Language ModelsAI EvaluationBenchmark Development

Methods & Architectures

benchmark creationsystematic evaluationdefining evaluation dimensions Multimodal Large Language Models (MLLMs)

Applications & Tasks

Robotics Industrial Automation Human-Robot Interaction Smart Homes EvaluationReasoningPerceptionDecision-Making Evaluating MLLMs as embodied brainsAssessing reasoning and decision-making in manipulation tasksSystematic evaluation of embodied AI capabilities

Datasets & Benchmarks

Datasets

RoboBench

Related Fields

RoboticsArtificial IntelligenceComputer VisionNatural Language ProcessingEmbodied AI

Keywords

benchmarkroboticsembodied AImultimodal LLMmanipulationreasoningevaluationperceptiondecision makingMLLMRoboBench

Academic Context

#Embodied AI#Robotics#Multimodal Learning#Large Language Models#AI Evaluation#Benchmark Development

Commercial Potential

Potential Products

More intelligent and adaptable robotsAdvanced robotic control systemsPlatforms for developing and testing embodied AI

Target Industries

RoboticsManufacturingLogisticsHealthcareConsumer Electronics

Use Case Examples

Evaluating a robot's ability to understand complex instructions for assembling a product.Testing a robot's perception and reasoning in a cluttered home environment.Assessing a robot's decision-making process for picking and placing objects.

Competitive Edge

Offers a more comprehensive and realistic evaluation framework for MLLMs in robotics compared to existing benchmarks.

Market Opportunity

Massive and growing market for robotics and AI-powered automation.

Revenue Models

N/A (Benchmark).

Resource Requirements

Compute Needs

N/A (Benchmark).

Data Requirements

N/A (Benchmark).

Deployment Constraints

N/A (Benchmark).

Scalability

N/A (Benchmark).

Regulatory Considerations

Ethical considerations for robot behavior and decision-making.

Production Readiness

Maturity Level

Benchmark/Research Tool

Time to Market

N/A (Benchmark).

Patent Potential

Low, as it is a benchmark dataset.

View Full Paper Back to Papers