Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Adapting pre-trained video generation models into controllable world models
via latent actions is a promising step towards creating generalist world
models. The dominant paradigm adopts a two-stage approach that trains latent
action model (LAM) and the world model separately, resulting in redundant
training and limiting their potential for co-adaptation. A conceptually simple
and appealing idea is to directly replace the forward dynamic model in LAM with
a powerful world model and training them jointly, but it is non-trivial and
prone to representational collapse. In this work, we propose CoLA-World, which
for the first time successfully realizes this synergistic paradigm, resolving
the core challenge in joint learning through a critical warm-up phase that
effectively aligns the representations of the from-scratch LAM with the
pre-trained world model. This unlocks a co-evolution cycle: the world model
acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM,
while the LAM offers a more precise and adaptable control interface to the
world model. Empirically, CoLA-World matches or outperforms prior two-stage
methods in both video simulation quality and downstream visual planning,
establishing a robust and efficient new paradigm for the field.
Authors (6)
Yucen Wang
Fengming Zhang
De-Chuan Zhan
Li Zhao
Kaixin Wang
Jiang Bian
Submitted
October 30, 2025
Key Contributions
CoLA-World successfully enables the joint training of Latent Action Models (LAM) and World Models, overcoming the challenges of representational collapse and redundant training in two-stage approaches. It introduces a critical warm-up phase to align representations, allowing the world model to act as a tutor for the LAM, fostering a co-evolution cycle.
Business Value
Development of more realistic and controllable simulation environments for training AI agents (e.g., robots, autonomous vehicles), reducing the need for expensive real-world data collection and testing.