Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 85% Match Research Paper Robotics Researchers,AI Researchers,ML Engineers 2 weeks ago

Semantic World Models

robotics › sim-to-real
📄 Abstract

Abstract: Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at https://weirdlabuw.github.io/swm.
Authors (5)
Jacob Berg
Chuning Zhu
Yanda Bao
Ishan Durugkar
Abhishek Gupta
Submitted
October 22, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper proposes framing world modeling as a visual question answering problem about semantic information in future frames, rather than predicting future pixels. This allows vision-language models to be trained as 'semantic' world models through supervised finetuning on image-action-text data, enabling better planning for decision-making.

Business Value

Leads to more intelligent and adaptable robots capable of complex planning and decision-making in dynamic environments, improving efficiency and safety in applications like logistics, manufacturing, and exploration.