arxiv_ai 94% Match Research Paper ML Researchers,Deep Learning Engineers,AI Architects 2 weeks ago

UMoE: Unifying Attention and FFN with Shared Experts

large-language-models › model-architecture

📄 Abstract

Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

Authors (3)

Yuanhang Yang

Chaozheng Wang

Jing Li

Submitted

May 12, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces UMoE, a novel architecture that unifies MoE designs in both attention and FFN layers of transformers. It reformulates the attention mechanism to reveal an FFN-like structure, enabling attention-based MoE layers that achieve superior performance and efficient parameter sharing between attention and FFN components.

Business Value

Enables the development of larger and more capable AI models with improved computational efficiency, leading to faster training, reduced inference costs, and better performance across various AI tasks.

Paper Metadata

Innovation Type

Architectural

Deployment Feasibility

Moderate. Requires specialized implementation for the unified MoE layers, but the goal is efficiency, which aids deployment.

Limitations Addressed

Addresses suboptimal performance and specialized implementation requirements of existing attention-based MoE layers compared to FFN-based MoE. Solves the challenge of effectively integrating MoE into both attention and FFNs within a unified framework.

Performance Gains

Achieves superior performance compared to existing attention-based MoE layers.

Technical Tags

Mixture of Experts (MoE)Sparse MoETransformer ArchitecturesAttention MechanismFeed-Forward Networks (FFN)Parameter SharingUMoEEfficient Transformers

Research Topics

Efficient Deep LearningTransformer ArchitecturesModel ScalingNeural Network Design

Methods & Architectures

Reformulation of attention mechanismAttention-based MoE layersParameter sharing between FFN and attention componentsNovel architecture (UMoE) UMoEMixture of Experts (MoE)TransformersAttention LayersFeed-Forward Networks (FFN)

Applications & Tasks

Natural Language Processing Computer Vision Large Model Training Improving efficiency and performance of MoE modelsUnifying MoE designs across attention and FFN layersReducing computational cost of large models Training large-scale modelsEnhancing model performance through sparsityDeveloping more efficient transformer variants

Related Fields

Deep LearningTransformer ArchitecturesEfficient AIModel Compression

Keywords

Mixture of ExpertsMoETransformersAttentionFFNParameter SharingUMoESparse ModelsEfficient AIModel ArchitectureDeep Learning

Academic Context

#Efficient Deep Learning#Transformer Architectures#Model Scaling#Neural Network Design

Commercial Potential

Potential Products

More efficient foundation models for NLP and VisionSpecialized hardware accelerators for MoE models

Target Industries

TechnologyCloud ComputingAI Research

Use Case Examples

Training a large language model with significantly reduced computational costDeveloping a vision model that can process images more efficientlyEnabling real-time AI applications requiring large models

Competitive Edge

Unifies and improves upon existing MoE approaches by extending them to attention layers and enabling parameter sharing, offering a more holistic and performant solution.

Market Opportunity

Strong trend towards larger, more efficient models using MoE.

Revenue Models

Licensing the architectureproviding optimized model implementations.

Resource Requirements

Compute Needs

High (for training large models), but aims for better efficiency than dense models.

Data Requirements

Large-scale datasets for training

Deployment Constraints

Requires efficient implementation of the unified MoE layers.

Scalability

MoE architectures are generally designed for scalability, and this unified approach aims to enhance that.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Potential for patents on the UMoE architecture and its specific implementations.

View Full Paper Back to Papers