Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Modern large language models leverage Mixture-of-Experts (MoE) architectures
for efficient scaling, but face a critical challenge: functionally similar
experts are often selected simultaneously, creating redundant computation and
limiting effective model capacity. Existing auxiliary balance loss methods
improve token distribution but fail to address the underlying expert diversity
problem. We introduce GatePro, a novel parameter-free method that directly
promotes expert selection diversity. GatePro identifies the most similar expert
pairs and introduces localized competition mechanisms, preventing redundant
expert co-activation while maintaining natural expert specialization. Our
comprehensive evaluation demonstrates GatePro's effectiveness across model
scales and benchmarks. Analysis demonstrates GatePro's ability to achieve
enhanced expert diversity, where experts develop more distinct and
complementary capabilities, avoiding functional redundancy. This approach can
be deployed hot-swappable during any training phase without additional
learnable parameters, offering a practical solution for improving MoE
effectiveness.
Authors (10)
Chen Zheng
Yuhang Cai
Deyi Liu
Jin Ma
Yiyuan Ma
Yuan Yang
+4 more
Submitted
October 15, 2025
Key Contributions
GatePro is a novel, parameter-free method that directly optimizes expert selection diversity in Mixture-of-Experts (MoE) models. By identifying and introducing competition between similar experts, it prevents redundant computation and promotes specialization, thereby enhancing effective model capacity. This approach improves upon existing balance loss methods by addressing the underlying expert diversity problem, leading to more efficient and capable MoE LLMs.
Business Value
Enables the development of more efficient and powerful large language models by optimizing the use of Mixture-of-Experts architectures. This can lead to reduced computational costs for training and inference, making advanced AI more accessible and deployable.