arxiv_ai 85% Match Research Paper AI researchers,Reinforcement learning practitioners,Robotics engineers 2 weeks ago

Benchmarking World-Model Learning

reinforcement-learning › multi-agent

📄 Abstract

Abstract: Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.

Authors (11)

Archana Warrier

Dat Nguyen

Michelangelo Naim

Moksh Jain

Yichao Liang

Karen Schroeder

+5 more

Submitted

October 22, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This paper introduces WorldTest, a novel protocol for benchmarking world-model learning agents. It separates reward-free interaction from a scored test phase in a different environment, enabling open-ended evaluation of models supporting diverse, unknown tasks, moving beyond next-frame prediction.

Business Value

Enables more reliable assessment of AI agents' understanding of their environment, crucial for developing robust autonomous systems that can adapt to new situations.

Paper Metadata

Innovation Type

Evaluation Protocol/Methodological

Deployment Feasibility

High (as an evaluation framework).

Limitations Addressed

Current methods for learning and evaluating world models are anchored to next-frame prediction and reward maximization in the same environment, diverging from the goal of supporting many downstream tasks.

Performance Gains

Provides a framework to measure gains in generalization and task support capabilities of world models.

Technical Tags

world modelsmodel learningreinforcement learningagent evaluationnext-frame predictionreward maximizationopen-ended evaluationtransfer learning

Research Topics

World ModelsReinforcement LearningAI Agent EvaluationModel-Based RLTransfer Learning

Methods & Architectures

Reward-free interactionScored test phaseOpen-ended evaluation protocol (WorldTest)Next-frame prediction (as a baseline)

Applications & Tasks

Robotics Game AI Autonomous Systems Simulation Evaluating World Model LearningGeneralization of Learned ModelsMeasuring Agent Capabilities Beyond RewardRobust AI Evaluation Evaluating agents' ability to learn world modelsAssessing model utility for diverse downstream tasksBenchmarking model-learning agents

Datasets & Benchmarks

Datasets

AutumnBench

Benchmarks

WorldTest protocol • AutumnBench (43 interactive grid-world environments, 129 tasks)

Performance on diverse downstream tasksAbility to predict unobserved statesEstimating consequences of actionsPlanningDetecting dynamics changes

Related Fields

Reinforcement LearningArtificial IntelligenceRoboticsMachine LearningAI Evaluation

Keywords

world modelsreinforcement learningagent evaluationmodel learningbenchmarkinggeneralizationtransfer learningautonomous systemssimulationAutumnBench

Academic Context

#World Models#Reinforcement Learning#AI Agent Evaluation#Model-Based RL#Transfer Learning

Commercial Potential

Potential Products

Standardized evaluation suites for world modelsTools for developing more generalizable AI agents

Target Industries

AI ResearchRoboticsGamingAutonomous Vehicles

Use Case Examples

Comparing different world model learning algorithms based on their ability to adapt to new tasks.Using WorldTest to ensure AI agents trained in simulation can perform reliably in novel environments.

Competitive Edge

Offers a more comprehensive and realistic evaluation framework for world models compared to existing methods focused solely on next-frame prediction.

Market Opportunity

Growing importance of robust AI evaluation in the development of autonomous systems.

Revenue Models

N/A (framework)

Resource Requirements

Compute Needs

Moderate (for running evaluations)

Data Requirements

Environments and tasks for training and testing world models.

Deployment Constraints

Requires careful design of diverse environments and tasks for effective evaluation.

Scalability

The WorldTest protocol is designed to be open-ended and scalable to new environments and tasks.

Production Readiness

Maturity Level

Evaluation Framework

Time to Market

N/A (framework)

Patent Potential

Low

View Full Paper Back to Papers