Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model
capacity while preserving computational efficiency. Despite its notable success
in large language models (LLMs), existing attempts to apply MoE to Diffusion
Transformers (DiTs) have yielded limited gains. We attribute this gap to
fundamental differences between language and visual tokens. Language tokens are
semantically dense with pronounced inter-token variation, while visual tokens
exhibit spatial redundancy and functional heterogeneity, hindering expert
specialization in vision MoE. To this end, we present ProMoE, an MoE framework
featuring a two-step router with explicit routing guidance that promotes expert
specialization. Specifically, this guidance encourages the router to partition
image tokens into conditional and unconditional sets via conditional routing
according to their functional roles, and refine the assignments of conditional
image tokens through prototypical routing with learnable prototypes based on
semantic content. Moreover, the similarity-based expert allocation in latent
space enabled by prototypical routing offers a natural mechanism for
incorporating explicit semantic guidance, and we validate that such guidance is
crucial for vision MoE. Building on this, we propose a routing contrastive loss
that explicitly enhances the prototypical routing process, promoting
intra-expert coherence and inter-expert diversity. Extensive experiments on
ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods
under both Rectified Flow and DDPM training objectives. Code and models will be
made publicly available.
Authors (11)
Yujie Wei
Shiwei Zhang
Hangjie Yuan
Yujin Han
Zhekai Chen
Jiayu Wang
+5 more
Submitted
October 28, 2025
Key Contributions
ProMoE introduces an effective MoE framework for Diffusion Transformers by addressing the challenges of expert specialization with visual tokens. It employs a two-step router with explicit routing guidance (conditional and prototypical routing) to partition and assign image tokens based on their functional roles, enabling efficient scaling of DiTs.
Business Value
Enables the development of more powerful and computationally efficient generative models for vision tasks, potentially reducing training and inference costs for high-resolution image and video generation.