Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively
predicts the next state across vision and language. Emu3.5 is pre-trained
end-to-end with a unified next-token prediction objective on a corpus of
vision-language interleaved data containing over 10 trillion tokens, primarily
derived from sequential frames and transcripts of internet videos. The model
naturally accepts interleaved vision-language inputs and generates interleaved
vision-language outputs. Emu3.5 is further post-trained with large-scale
reinforcement learning to enhance multimodal reasoning and generation. To
improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA),
which converts token-by-token decoding into bidirectional parallel prediction,
accelerating per-image inference by about 20x without sacrificing performance.
Emu3.5 exhibits strong native multimodal capabilities, including long-horizon
vision-language generation, any-to-image (X2I) generation, and complex
text-rich image generation. It also exhibits generalizable world-modeling
abilities, enabling spatiotemporally consistent world exploration and
open-world embodied manipulation across diverse scenarios and tasks. For
comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image
(Nano Banana) on image generation and editing tasks and demonstrates superior
results on a suite of interleaved generation tasks. We open-source Emu3.5 at
https://github.com/baaivision/Emu3.5 to support community research.
Authors (23)
Yufeng Cui
Honghao Chen
Haoge Deng
Xu Huang
Xinghang Li
Jirong Liu
+17 more
Submitted
October 30, 2025
Key Contributions
Introduces Emu3.5, a large-scale multimodal world model trained end-to-end on over 10 trillion tokens of interleaved vision-language data for next-state prediction. It enhances multimodal reasoning via RL and proposes Discrete Diffusion Adaptation (DiDA) to accelerate inference by ~20x without performance loss, enabling strong native multimodal capabilities like long-horizon generation and X2I.
Business Value
Enables faster and more sophisticated AI applications that understand and generate content across vision and language, such as advanced chatbots, creative tools, and more capable embodied agents.