Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Diffusion models have shown strong capabilities in generating high-quality
images from text prompts. However, these models often require large-scale
training data and significant computational resources to train, or suffer from
heavy structure with high latency. To this end, we propose Efficient Multimodal
Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal
diffusion model with only 304M parameters for fast image synthesis requiring
low training resources. We provide an easily reproducible baseline with
competitive results. Our model for 512px generation, trained with only 25M
public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on
GenEval and easily reaches to 0.72 with some post-training techniques such as
GRPO. Our design philosophy centers on token reduction as the computational
cost scales significantly with the token count. We adopt a highly compressive
visual tokenizer to produce a more compact representation and propose a novel
multi-path compression module for further compression of tokens. To enhance our
design, we introduce Position Reinforcement, which strengthens positional
information to maintain spatial coherence, and Alternating Subregion Attention
(ASA), which performs attention within subregions to further reduce
computational cost. In addition, we propose AdaLN-affine, an efficient
lightweight module for computing modulation parameters in transformer blocks.
Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT
serves as a strong and practical baseline for future research and contributes
to democratization of generative AI models.
Authors (5)
Tong Shen
Jingai Yu
Dong Zhou
Dong Li
Emad Barsoum
Submitted
October 31, 2025
Key Contributions
E-MMDiT proposes an efficient and lightweight multimodal diffusion model designed for fast image synthesis with low training resource requirements. It achieves this through significant token reduction via a compressive visual tokenizer and an optimized Transformer architecture, offering competitive results with fewer parameters.
Business Value
Democratizes access to high-quality image generation capabilities by reducing the hardware and data requirements, enabling smaller teams and individuals to leverage advanced AI.