Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 94% Match Research Paper ML Researchers,Deep Learning Engineers,AI Architects 2 weeks ago

UMoE: Unifying Attention and FFN with Shared Experts

large-language-models › model-architecture
📄 Abstract

Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.
Authors (3)
Yuanhang Yang
Chaozheng Wang
Jing Li
Submitted
May 12, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces UMoE, a novel architecture that unifies MoE designs in both attention and FFN layers of transformers. It reformulates the attention mechanism to reveal an FFN-like structure, enabling attention-based MoE layers that achieve superior performance and efficient parameter sharing between attention and FFN components.

Business Value

Enables the development of larger and more capable AI models with improved computational efficiency, leading to faster training, reduced inference costs, and better performance across various AI tasks.