Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper proposes framing world modeling as a visual question answering problem about semantic information in future frames, rather than predicting future pixels. This allows vision-language models to be trained as 'semantic' world models through supervised finetuning on image-action-text data, enabling better planning for decision-making.
Leads to more intelligent and adaptable robots capable of complex planning and decision-making in dynamic environments, improving efficiency and safety in applications like logistics, manufacturing, and exploration.