Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: World models have recently gained prominence for action-conditioned visual
prediction in complex environments. However, relying on only a few recent
observations causes them to lose long-term context. Consequently, within a few
steps, the generated scenes drift from what was previously observed,
undermining temporal coherence. This limitation, common in state-of-the-art
world models, which are diffusion-based, stems from the lack of a lasting
environment state. To address this problem, we introduce StateSpaceDiffuser,
where a diffusion model is enabled to perform long-context tasks by integrating
features from a state-space model, representing the entire interaction history.
This design restores long-term memory while preserving the high-fidelity
synthesis of diffusion models. To rigorously measure temporal consistency, we
develop an evaluation protocol that probes a model's ability to reinstantiate
seen content in extended rollouts. Comprehensive experiments show that
StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline,
maintaining a coherent visual context for an order of magnitude more steps. It
delivers consistent views in both a 2D maze navigation and a complex 3D
environment. These results establish that bringing state-space representations
into diffusion models is highly effective in demonstrating both visual details
and long-term memory. Project page:
https://insait-institute.github.io/StateSpaceDiffuser/.
Authors (6)
Nedko Savov
Naser Kazemi
Deheng Zhang
Danda Pani Paudel
Xi Wang
Luc Van Gool
Key Contributions
Introduces StateSpaceDiffuser, a novel approach that integrates features from state-space models into diffusion models to enable long-context visual prediction. This design restores long-term memory and temporal coherence while preserving the high-fidelity synthesis capabilities of diffusion models.
Business Value
Enables more realistic and predictable simulations for training autonomous agents (e.g., robots, self-driving cars) and for creating immersive virtual environments, reducing the need for extensive real-world data collection.