Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising
approach for scaling Transformer models. While initial works primarily
incorporated MoE into feed-forward network (FFN) layers, recent studies have
explored extending the MoE paradigm to attention layers to enhance model
performance. However, existing attention-based MoE layers require specialized
implementations and demonstrate suboptimal performance compared to their
FFN-based counterparts. In this paper, we aim to unify MoE designs in attention
and FFN layers by introducing a novel reformulation of the attention mechanism,
that reveals an underlying FFN-like structure within attention modules. Our
proposed architecture, UMoE, achieves superior performance through
attention-based MoE layers while enabling efficient parameter sharing between
FFN and attention components.
Authors (3)
Yuanhang Yang
Chaozheng Wang
Jing Li
Key Contributions
Introduces UMoE, a novel architecture that unifies MoE designs in both attention and FFN layers of transformers. It reformulates the attention mechanism to reveal an FFN-like structure, enabling attention-based MoE layers that achieve superior performance and efficient parameter sharing between attention and FFN components.
Business Value
Enables the development of larger and more capable AI models with improved computational efficiency, leading to faster training, reduced inference costs, and better performance across various AI tasks.