arxiv_cv 95% Match Research Paper AI researchers,Machine learning engineers,Content creators,Digital artists 2 days ago

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

generative-ai › diffusion

📄 Abstract

Abstract: Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

Authors (5)

Tong Shen

Jingai Yu

Dong Zhou

Dong Li

Emad Barsoum

Submitted

October 31, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

E-MMDiT proposes an efficient and lightweight multimodal diffusion model designed for fast image synthesis with low training resource requirements. It achieves this through significant token reduction via a compressive visual tokenizer and an optimized Transformer architecture, offering competitive results with fewer parameters.

Business Value

Democratizes access to high-quality image generation capabilities by reducing the hardware and data requirements, enabling smaller teams and individuals to leverage advanced AI.

Paper Metadata

Innovation Type

Efficient architecture and training strategy for diffusion models

Deployment Feasibility

Highly feasible due to its efficiency, allowing deployment on less powerful hardware and faster inference times compared to larger models.

Limitations Addressed

Large-scale training data and significant computational resources required by traditional diffusion models, as well as heavy structure leading to high latency.

Performance Gains

Achieves competitive results with only 304M parameters.,Trained with only 25M public data in 1.5 days on 8 AMD MI300X GPUs.

Technical Tags

image synthesisdiffusion modelsmultimodaltransformerefficientlightweightlow-resource trainingtoken reductionvisual tokenizerfast inference

Research Topics

Generative ModelsImage SynthesisDiffusion ModelsEfficient Deep LearningMultimodal AI

Methods & Architectures

Efficient Multimodal Diffusion Transformer (E-MMDiT)Compressive visual tokenizerToken reduction techniquesTransformer architecture Diffusion ModelTransformer

Applications & Tasks

Image Generation Content Creation Digital Art Design Fast Image SynthesisReducing Training Resource RequirementsHigh Latency in Diffusion Models Generating high-quality images from text prompts efficientlyEnabling diffusion model training with limited data and compute

Datasets & Benchmarks

Benchmarks

GenEval: 0.66 (0.72 with GRPO)

GenEval score

Related Fields

Generative AIDeep LearningComputer VisionEfficient AINatural Language Processing

Keywords

image synthesisdiffusion modelmultimodaltransformerefficientlightweightlow-resourcetoken reductionfast inferencetext-to-imagegenerative AI

Academic Context

#Generative Models#Image Synthesis#Diffusion Models#Efficient Deep Learning#Multimodal AI

Commercial Potential

Potential Products

Efficient text-to-image generation APIsLightweight AI art toolsOn-device image generation models

Target Industries

Media and EntertainmentAdvertisingDesignGamingE-commerce

Use Case Examples

Generating marketing visuals quickly from text descriptions.Creating concept art for games or films with limited computational resources.Enabling real-time image generation in interactive applications.

Competitive Edge

Offers a significant improvement in efficiency for diffusion models, enabling comparable or better performance with substantially fewer parameters and lower training costs, making advanced image synthesis more accessible.

Market Opportunity

Rapidly growing market for generative AI tools, particularly text-to-image synthesis.

Revenue Models

API access feeslicensing to software providersSaaS platforms.

Resource Requirements

Compute Needs

Low to Moderate for training (relative to other diffusion models), Moderate for inference.

Data Requirements

Moderate size public datasets (e.g., 25M images).

Deployment Constraints

Ensuring consistent quality and avoiding artifacts. Potential limitations in generating highly complex or nuanced imagery compared to larger models.

Scalability

Designed for efficiency, suggesting good scalability in terms of deployment on diverse hardware.

Regulatory Considerations

Potential for misuse in generating misinformation or harmful content.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into commercial products.

Patent Potential

Moderate, for the novel tokenizer design and architectural optimizations.

View Full Paper Back to Papers