Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Model-learning agents should gather information to learn world models that
support many downstream tasks and inferences, such as predicting unobserved
states, estimating near- and far-term consequences of actions, planning action
sequences, and detecting changes in dynamics. Current methods for learning and
evaluating world models diverge from this goal: training and evaluation are
anchored to next-frame prediction, and success is scored by reward maximization
in the same environment. We propose WorldTest, a protocol to evaluate
model-learning agents that separates reward-free interaction from a scored test
phase in a different but related environment. WorldTest is
open-ended$\unicode{x2014}$models should support many different tasks unknown
ahead of time$\unicode{x2014}$and agnostic to model representation, allowing
comparison across approaches. We instantiated WorldTest with AutumnBench, a
suite of 43 interactive grid-world environments and 129 tasks across three
families: masked-frame prediction, planning, and predicting changes to the
causal dynamics. We compared 517 human participants and three frontier models
on AutumnBench. We found that humans outperform the models, and scaling compute
improves performance only in some environments but not others. WorldTest
provides a novel template$\unicode{x2014}$reward-free exploration, derived
tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn
about environment dynamics, and AutumnBench exposes significant headroom in
world-model learning.
Authors (11)
Archana Warrier
Dat Nguyen
Michelangelo Naim
Moksh Jain
Yichao Liang
Karen Schroeder
+5 more
Submitted
October 22, 2025
Key Contributions
This paper introduces WorldTest, a novel protocol for benchmarking world-model learning agents. It separates reward-free interaction from a scored test phase in a different environment, enabling open-ended evaluation of models supporting diverse, unknown tasks, moving beyond next-frame prediction.
Business Value
Enables more reliable assessment of AI agents' understanding of their environment, crucial for developing robust autonomous systems that can adapt to new situations.