Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper Robotics researchers,AI researchers,Developers of autonomous systems,LLM developers 1 week ago

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

robotics โ€บ embodied-agents
๐Ÿ“„ Abstract

Abstract: We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.
Authors (7)
Callum Sharrock
Lukas Petersson
Hanna Petersson
Axel Backlund
Axel Wennstrรถm
Kristoffer Nordstrรถm
+1 more
Submitted
October 23, 2025
arXiv Category
cs.RO
arXiv PDF

Key Contributions

Introduces Butter-Bench, a novel benchmark designed to evaluate the 'practical intelligence' of LLM-controlled robots in navigating the complexities of the physical world. The benchmark reveals that while LLMs excel in analytical tasks, humans still significantly outperform them in embodied tasks, particularly in multi-step spatial planning and social understanding.

Business Value

Provides a crucial tool for developers and researchers to accurately assess and improve the real-world capabilities of robots powered by LLMs, accelerating the development of more capable and reliable autonomous systems.