📄 Abstract
Abstract: Building robots that can perceive, reason, and act in dynamic, unstructured
environments remains a core challenge. Recent embodied systems often adopt a
dual-system paradigm, where System 2 handles high-level reasoning while System
1 executes low-level control. In this work, we refer to System 2 as the
embodied brain, emphasizing its role as the cognitive core for reasoning and
decision-making in manipulation tasks. Given this role, systematic evaluation
of the embodied brain is essential. Yet existing benchmarks emphasize execution
success, or when targeting high-level reasoning, suffer from incomplete
dimensions and limited task realism, offering only a partial picture of
cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark
that systematically evaluates multimodal large language models (MLLMs) as
embodied brains. Motivated by the critical roles across the full manipulation
pipeline, RoboBench defines five dimensions-instruction comprehension,
perception reasoning, generalized planning, affordance prediction, and failure
analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure
realism, we curate datasets across diverse embodiments, attribute-rich objects,
and multi-view scenes, drawing from large-scale real robotic data. For
planning, RoboBench introduces an evaluation framework,
MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether
predicted plans can achieve critical object-state changes. Experiments on 14
MLLMs reveal fundamental limitations: difficulties with implicit instruction
comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained
affordance understanding, and execution failure diagnosis. RoboBench provides a
comprehensive scaffold to quantify high-level cognition, and guide the
development of next-generation embodied MLLMs. The project page is in
https://robo-bench.github.io.
Authors (21)
Yulin Luo
Chun-Kai Fan
Menghang Dong
Jiayu Shi
Mengdi Zhao
Bo-Wen Zhang
+15 more
Submitted
October 20, 2025
Key Contributions
RoboBench is a comprehensive benchmark designed to systematically evaluate Multimodal Large Language Models (MLLMs) as the 'embodied brain' for robots. It addresses the limitations of existing benchmarks by focusing on five critical dimensions across the manipulation pipeline (instruction comprehension, perception, reasoning, planning, and action) and incorporating task realism, providing a more holistic assessment of cognitive capabilities.
Business Value
Accelerates the development of more capable and reliable robots by providing a standardized and comprehensive evaluation framework for their AI 'brains', crucial for widespread adoption in industry and homes.