arxiv_cv 95% Match Research Paper AI researchers,ML engineers,Developers of generative AI applications 5 days ago

Emu3.5: Native Multimodal Models are World Learners

large-language-models › multimodal-llms

📄 Abstract

Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

Authors (23)

Yufeng Cui

Honghao Chen

Haoge Deng

Xu Huang

Xinghang Li

Jirong Liu

+17 more

Submitted

October 30, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Emu3.5, a large-scale multimodal world model trained end-to-end on over 10 trillion tokens of interleaved vision-language data for next-state prediction. It enhances multimodal reasoning via RL and proposes Discrete Diffusion Adaptation (DiDA) to accelerate inference by ~20x without performance loss, enabling strong native multimodal capabilities like long-horizon generation and X2I.

Business Value

Enables faster and more sophisticated AI applications that understand and generate content across vision and language, such as advanced chatbots, creative tools, and more capable embodied agents.

Paper Metadata

Innovation Type

Model Architecture and Training Method

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference due to model scale. DiDA improves deployment feasibility by reducing inference latency.

Limitations Addressed

Inefficiency in multimodal model inference,Limited ability of models to learn world dynamics from sequential data,Challenges in long-horizon multimodal generation and reasoning

Performance Gains

20x acceleration in per-image inference speed (DiDA)

Technical Tags

multimodal modelsworld modelsnext-token predictionvision-languageinterleaved datareinforcement learningdiscrete diffusion adaptationinference efficiencylong-horizon generationX2I generation

Research Topics

Multimodal AIWorld ModelsGenerative ModelsReinforcement LearningEfficient Inference

Methods & Architectures

Unified next-token prediction objectiveLarge-scale reinforcement learningDiscrete Diffusion Adaptation (DiDA)Interleaved vision-language input/output processing Emu3.5Transformer-based multimodal model

Applications & Tasks

Content Generation Human-Computer Interaction Robotics Creative AI Multimodal GenerationMultimodal ReasoningEfficient Model Inference Predicting next state across vision and languageLong-horizon multimodal generationAny-to-image generationAccelerating inference

Related Fields

Generative AIReinforcement LearningComputer VisionNatural Language Processing

Keywords

Multimodal AIWorld ModelsLLMVision-LanguageNext-Token PredictionReinforcement LearningDiffusion ModelsInference EfficiencyGenerative AIEmu3.5Interleaved DataLong-Horizon Generation

Academic Context

#Multimodal AI#World Models#Generative Models#Reinforcement Learning#Efficient Inference

Commercial Potential

Potential Products

Advanced multimodal content creation toolsAI assistants with rich contextual understandingSimulators for training agents

Target Industries

Media & EntertainmentTechnologyGamingRobotics

Use Case Examples

Generating a story with accompanying images based on a prompt.Creating video sequences from textual descriptions.An AI agent that can understand and interact with both visual and textual information in real-time.

Competitive Edge

Positions itself as a 'world learner' by natively handling interleaved vision-language data and predicting future states, aiming for a more holistic understanding than models focused solely on discrete tasks.

Market Opportunity

Very large, tapping into the rapidly expanding generative AI and multimodal AI markets.

Revenue Models

API accesslicensing to enterprisesintegration into existing platforms.

Resource Requirements

Compute Needs

Extremely high for training (trillions of tokens, large model size). Inference requirements are also high but significantly reduced by DiDA.

Data Requirements

Massive, diverse dataset of interleaved vision-language data, primarily from internet videos.

Deployment Constraints

High computational cost for inference, even with DiDA,Model size and memory requirements,Need for specialized hardware for optimal performance

Scalability

Training is inherently massive scale. Inference scalability is improved by DiDA, but still requires substantial resources.

Regulatory Considerations

Moderaterelated to data privacy and potential misuse of generative capabilities.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years, for robust, optimized deployment in products.

Patent Potential

High, particularly for the Discrete Diffusion Adaptation (DiDA) technique and the overall world model architecture.

View Full Paper Back to Papers