Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 90% Match Research Paper Computer Vision Researchers,ML Engineers,AI Researchers 1 week ago

Mixture of Experts in Image Classification: What's the Sweet Spot?

large-language-models › model-architecture
📄 Abstract

Abstract: Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.
Authors (4)
Mathurin Videau
Alessandro Leite
Marc Schoenauer
Olivier Teytaud
Submitted
November 27, 2024
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This work systematically analyzes the integration of MoE layers into image classification architectures, finding that moderate parameter activation offers the best performance-efficiency trade-off. It reveals that MoE layers are most effective for small to medium models and that a 'Last-2' placement heuristic is robust across architectures.

Business Value

Enables more efficient training and deployment of vision models, making advanced capabilities accessible for a wider range of applications and hardware.