Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Researchers,Machine Learning Engineers,LLM Developers,Deep Learning Architects 3 weeks ago

GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models

large-language-models › model-architecture
📄 Abstract

Abstract: Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro's effectiveness across model scales and benchmarks. Analysis demonstrates GatePro's ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.
Authors (10)
Chen Zheng
Yuhang Cai
Deyi Liu
Jin Ma
Yiyuan Ma
Yuan Yang
+4 more
Submitted
October 15, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

GatePro is a novel, parameter-free method that directly optimizes expert selection diversity in Mixture-of-Experts (MoE) models. By identifying and introducing competition between similar experts, it prevents redundant computation and promotes specialization, thereby enhancing effective model capacity. This approach improves upon existing balance loss methods by addressing the underlying expert diversity problem, leading to more efficient and capable MoE LLMs.

Business Value

Enables the development of more efficient and powerful large language models by optimizing the use of Mixture-of-Experts architectures. This can lead to reduced computational costs for training and inference, making advanced AI more accessible and deployable.