arxiv_ai 85% Match Research Paper Robotics Researchers,AI Researchers,ML Engineers 2 weeks ago

Semantic World Models

robotics › sim-to-real

📄 Abstract

Abstract: Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at https://weirdlabuw.github.io/swm.

Authors (5)

Jacob Berg

Chuning Zhu

Yanda Bao

Ishan Durugkar

Abhishek Gupta

Submitted

October 22, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper proposes framing world modeling as a visual question answering problem about semantic information in future frames, rather than predicting future pixels. This allows vision-language models to be trained as 'semantic' world models through supervised finetuning on image-action-text data, enabling better planning for decision-making.

Business Value

Leads to more intelligent and adaptable robots capable of complex planning and decision-making in dynamic environments, improving efficiency and safety in applications like logistics, manufacturing, and exploration.

Paper Metadata

Innovation Type

Conceptual/Methodological

Deployment Feasibility

Moderate, requires integration of VLMs and potentially specialized training data, but leverages existing VLM architectures.

Limitations Addressed

The mismatch between predicting future pixels and the actual planning objective in conventional world models, where strong pixel reconstruction doesn't always correlate with good planning decisions.

Performance Gains

Enables better planning decisions by focusing on task-relevant semantic information rather than pixel-level reconstruction.

Technical Tags

world modelsrobotic controlsemantic informationvisual question answeringvision-language modelsplanningsupervised finetuningimage-action-text datapixel predictiontask-relevant prediction

Research Topics

Robotics PlanningWorld ModelsVision-Language IntegrationEmbodied AI

Methods & Architectures

Semantic World ModelingVisual Question Answering (VQA) formulationSupervised FinetuningImage-Action-Text data training Vision-Language Models (VLMs)

Applications & Tasks

Robotics Autonomous Systems AI Planning Improving robotic planningBridging the gap between pixel prediction and planning objectivesDeveloping more effective world models Robotic controlPlanning for decision-makingLearning world models

Related Fields

RoboticsComputer VisionNatural Language ProcessingMachine LearningAI Planning

Keywords

world modelsroboticsplanningvision-language modelssemantic informationVQAsupervised learningembodied AIdecision makingcontrol

Academic Context

#Robotics Planning#World Models#Vision-Language Integration#Embodied AI

Commercial Potential

Potential Products

Advanced robotic control systemsAI planning modules for autonomous agents

Target Industries

RoboticsAutomotiveLogisticsManufacturingAerospace

Use Case Examples

Robots learning to navigate complex environmentsAutonomous vehicles making planning decisionsAI agents performing multi-step tasks

Competitive Edge

Offers a novel approach to world modeling by leveraging the power of vision-language models and focusing on semantic understanding, potentially outperforming pixel-prediction-based methods for planning tasks.

Market Opportunity

Growing market for intelligent robotics and autonomous systems.

Revenue Models

Licensing of core technologydevelopment of specialized robotic AI solutions.

Resource Requirements

Compute Needs

High, for training large vision-language models.

Data Requirements

Requires image-action-text data, potentially curated for specific tasks.

Deployment Constraints

Real-time inference speed and computational resources for complex planning.

Scalability

Scalable to different robotic tasks and environments by adapting the VLM and training data.

Production Readiness

Maturity Level

Research

Time to Market

18-24 months for integration into robotic systems.

Patent Potential

Moderate, for the semantic world modeling approach.

View Full Paper Back to Papers