arxiv_ai 95% Match Research Paper Robotics researchers,AI researchers,Developers of autonomous systems,LLM developers 1 week ago

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

robotics › embodied-agents

📄 Abstract

Abstract: We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.

Authors (7)

Callum Sharrock

Lukas Petersson

Hanna Petersson

Axel Backlund

Axel Wennström

Kristoffer Nordström

+1 more

Submitted

October 23, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

Introduces Butter-Bench, a novel benchmark designed to evaluate the 'practical intelligence' of LLM-controlled robots in navigating the complexities of the physical world. The benchmark reveals that while LLMs excel in analytical tasks, humans still significantly outperform them in embodied tasks, particularly in multi-step spatial planning and social understanding.

Business Value

Provides a crucial tool for developers and researchers to accurately assess and improve the real-world capabilities of robots powered by LLMs, accelerating the development of more capable and reliable autonomous systems.

Paper Metadata

Innovation Type

Benchmark Development and Evaluation

Deployment Feasibility

High for the benchmark itself. The findings suggest current LLMs need significant improvement for robust robotic deployment in complex environments.

Limitations Addressed

Lack of standardized benchmarks for evaluating the practical, embodied intelligence of LLM-controlled robots in real-world scenarios.

Performance Gains

Humans achieve a mean score of 95% on Butter-Bench, while the best LLMs score 40%. Fine-tuning for embodied reasoning did not improve LLM scores.

Technical Tags

LLM controlled robotsPractical intelligenceEmbodied reasoningRobotic controlVision Language Action (VLA)Hierarchical architecturePhysical world interactionSpatial planningSocial understandingBenchmark evaluation

Research Topics

Robotic IntelligenceEmbodied AILLM Applications in RoboticsRobot EvaluationHuman-Robot Interaction

Methods & Architectures

Benchmark creationComparative evaluation (LLMs vs. Humans)Isolation of LLM component in robotic control Hierarchical robotic control architectureLLM-based high-level reasoningVision Language Action (VLA) models

Applications & Tasks

Robotics Autonomous Systems Human-Robot Interaction Evaluating practical intelligence of LLM-controlled robotsBridging the gap between analytical and physical world intelligenceAssessing LLM capabilities in complex physical tasks Robot navigationTask execution in physical environmentsInteracting with the physical worldMulti-step spatial planningSocial understanding for robots

Datasets & Benchmarks

Benchmarks

Butter-Bench

Score on Butter-Bench

Related Fields

RoboticsArtificial IntelligenceLarge Language ModelsEmbodied AIHuman-Robot InteractionBenchmarking

Keywords

RoboticsLLMsEmbodied AIBenchmarkPractical IntelligenceAutonomous SystemsRobot ControlSpatial PlanningSocial UnderstandingEvaluationVision Language Action

Academic Context

#Robotic Intelligence#Embodied AI#LLM Applications in Robotics#Robot Evaluation#Human-Robot Interaction

Companies & Organizations

Companies Mentioned

OpenAI Anthropic Google

Commercial Potential

Potential Products

More capable autonomous robotsAdvanced robotic assistantsSafer autonomous vehicles

Target Industries

RoboticsAutomotiveLogisticsManufacturingConsumer Electronics

Use Case Examples

Robots performing complex assembly tasks in a factory.Autonomous delivery robots navigating cluttered urban environments.Personal assistant robots interacting naturally with humans.

Competitive Edge

Highlights the current gap between LLM capabilities in analytical reasoning and their performance in embodied, practical tasks, setting a new standard for evaluation.

Market Opportunity

Rapid growth in the robotics and AI markets, with increasing interest in LLM integration.

Revenue Models

Consultingspecialized LLM development for robotics.

Resource Requirements

Compute Needs

Not directly applicable to the benchmark itself, but significant for training/running LLMs for robotics.

Data Requirements

The benchmark defines its own task environment and requirements.

Deployment Constraints

Current LLMs struggle with the nuances of the physical world, limiting their direct deployment for complex robotic tasks without significant safety and control layers.

Scalability

The benchmark is designed to be scalable to different robotic platforms and tasks.

Production Readiness

Maturity Level

Benchmark/Evaluation Framework

Time to Market

Immediate for benchmark usage; longer for LLMs to reach human-level practical intelligence.

Licensing

Likely open-source for the benchmark.

View Full Paper Back to Papers