Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Robotics Engineers,Game Developers,Simulation Engineers 2 days ago

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

generative-ai › diffusion
📄 Abstract

Abstract: World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.
Authors (6)
Nedko Savov
Naser Kazemi
Deheng Zhang
Danda Pani Paudel
Xi Wang
Luc Van Gool
Submitted
May 28, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces StateSpaceDiffuser, a novel approach that integrates features from state-space models into diffusion models to enable long-context visual prediction. This design restores long-term memory and temporal coherence while preserving the high-fidelity synthesis capabilities of diffusion models.

Business Value

Enables more realistic and predictable simulations for training autonomous agents (e.g., robots, self-driving cars) and for creating immersive virtual environments, reducing the need for extensive real-world data collection.