Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Mixture-of-Experts (MoE) models have shown promising potential for
parameter-efficient scaling across domains. However, their application to image
classification remains limited, often requiring billion-scale datasets to be
competitive. In this work, we explore the integration of MoE layers into image
classification architectures using open datasets. We conduct a systematic
analysis across different MoE configurations and model scales. We find that
moderate parameter activation per sample provides the best trade-off between
performance and efficiency. However, as the number of activated parameters
increases, the benefits of MoE diminish. Our analysis yields several practical
insights for vision MoE design. First, MoE layers most effectively strengthen
tiny and mid-sized models, while gains taper off for large-capacity networks
and do not redefine state-of-the-art ImageNet performance. Second, a Last-2
placement heuristic offers the most robust cross-architecture choice, with
Every-2 slightly better for Vision Transform (ViT), and both remaining
effective as data and model scale increase. Third, larger datasets (e.g.,
ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized
effectively without changing placement, as increased data reduces overfitting
and promotes broader expert specialization. Finally, a simple linear router
performs best, suggesting that additional routing complexity yields no
consistent benefit.
Authors (4)
Mathurin Videau
Alessandro Leite
Marc Schoenauer
Olivier Teytaud
Submitted
November 27, 2024
Key Contributions
This work systematically analyzes the integration of MoE layers into image classification architectures, finding that moderate parameter activation offers the best performance-efficiency trade-off. It reveals that MoE layers are most effective for small to medium models and that a 'Last-2' placement heuristic is robust across architectures.
Business Value
Enables more efficient training and deployment of vision models, making advanced capabilities accessible for a wider range of applications and hardware.