arxiv_cv 90% Match Research Paper AI Researchers,Computer Vision Engineers,ML Engineers working on generative models 1 week ago

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

large-language-models › model-architecture

📄 Abstract

Abstract: Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

Authors (11)

Yujie Wei

Shiwei Zhang

Hangjie Yuan

Yujin Han

Zhekai Chen

Jiayu Wang

+5 more

Submitted

October 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

ProMoE introduces an effective MoE framework for Diffusion Transformers by addressing the challenges of expert specialization with visual tokens. It employs a two-step router with explicit routing guidance (conditional and prototypical routing) to partition and assign image tokens based on their functional roles, enabling efficient scaling of DiTs.

Business Value

Enables the development of more powerful and computationally efficient generative models for vision tasks, potentially reducing training and inference costs for high-resolution image and video generation.

Paper Metadata

Innovation Type

Algorithmic Improvement / Architecture Design

Deployment Feasibility

Moderate. Requires implementing the MoE routing mechanism within DiT architectures.

Limitations Addressed

Limited success of MoE in Diffusion Transformers compared to LLMs, attributed to visual token properties (spatial redundancy, functional heterogeneity) that hinder expert specialization.

Technical Tags

Mixture-of-Experts (MoE)Diffusion Transformers (DiTs)Expert SpecializationRouting GuidanceConditional RoutingPrototypical RoutingVisual TokensSpatial RedundancyFunctional HeterogeneityComputational Efficiency

Research Topics

Model ScalingEfficient Deep LearningComputer VisionGenerative ModelsTransformer Architectures

Methods & Architectures

ProMoE frameworkTwo-step routerExplicit routing guidanceConditional routingPrototypical routing Mixture-of-Experts (MoE)Diffusion Transformer (DiT)

Applications & Tasks

Image Generation Video Generation Computer Vision Tasks Limited gains from MoE in Diffusion TransformersHindered expert specialization in vision MoESpatial redundancy and functional heterogeneity of visual tokens Image GenerationVideo GenerationConditional Generation

Related Fields

Computer VisionGenerative AIDeep LearningTransformer ArchitecturesEfficient AI

Keywords

Mixture-of-ExpertsDiffusion TransformersMoEDiTExpert SpecializationRoutingGenerative ModelsComputer VisionEfficient AITransformer

Academic Context

#Model Scaling#Efficient Deep Learning#Computer Vision#Generative Models#Transformer Architectures

Commercial Potential

Potential Products

More efficient generative models for image/video creationFoundation models for vision tasks

Target Industries

Media and EntertainmentGamingAdvertisingTechnology

Use Case Examples

Generating high-resolution images with fewer computational resourcesTraining large-scale video generation models more efficiently

Competitive Edge

Provides a more effective way to scale Diffusion Transformers using MoE compared to prior attempts, by specifically addressing the unique characteristics of visual tokens.

Market Opportunity

Significant growth in generative AI market, particularly for vision tasks.

Revenue Models

Licensing of modelscloud-based generation services.

Resource Requirements

Compute Needs

Training MoE models, especially for vision, is computationally intensive, though the goal is efficiency during inference.

Data Requirements

Requires large-scale image and video datasets for training.

Deployment Constraints

Complexity of MoE routing can add overhead; careful implementation is needed.

Scalability

The core motivation is to enable efficient scaling of Diffusion Transformers.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for adoption in major generative model frameworks.

Patent Potential

Moderate, for the specific routing mechanisms and MoE framework design for DiTs.

View Full Paper Back to Papers