Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly
been adopted to further scale large language models (LLMs). However, frequent
failures still pose significant challenges as training scales. The cost of even
a single failure is significant, as all GPUs need to idle wait until the
failure is resolved, potentially losing considerable training progress as
training has to restart from checkpoints. This problem is exacerbated by the
growing use of spot instances on public clouds for model training, which
despite offering substantial cost savings, introduce frequent
preemptions-essentially failures that regularly occur throughout the training
process. Existing solutions for efficient fault-tolerant training either lack
elasticity or rely on building resiliency into pipeline parallelism, which
cannot be applied to MoE models due to the expert parallelism strategy adopted
by the MoE architecture.
We present Lazarus, a system for resilient and elastic training of MoE
models. Lazarus adaptively allocates expert replicas to address the inherent
imbalance in expert workload and speeds up training, while a provably optimal
expert placement algorithm is developed to maximize the probability of recovery
upon failures. Through adaptive expert placement and a flexible token
dispatcher, Lazarus can also fully utilize all available nodes after failures,
leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE
training systems by up to 5.7x under frequent node failures and 3.4x on a real
spot instance trace.
Authors (13)
Yongji Wu
Wenjie Qu
Xueshen Liu
Tianyang Tao
Yifan Qiao
Zhuang Wang
+7 more
Key Contributions
Lazarus is a system designed for resilient and elastic training of Mixture-of-Experts (MoE) models, addressing the challenges of frequent failures and preemptions, especially when using spot instances. It provides elasticity and fault tolerance specifically tailored for MoE architectures, overcoming limitations of existing solutions that rely on pipeline parallelism.
Business Value
Dramatically reduces the cost and improves the reliability of training massive MoE models by enabling the effective use of cheaper spot instances and minimizing downtime, making large-scale AI development more economically viable.