Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper ML engineers,Researchers training large models,Cloud infrastructure engineers,AI platform developers 1 week ago

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

large-language-models › training-methods
📄 Abstract

Abstract: Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs). However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to idle wait until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. This problem is exacerbated by the growing use of spot instances on public clouds for model training, which despite offering substantial cost savings, introduce frequent preemptions-essentially failures that regularly occur throughout the training process. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.
Authors (13)
Yongji Wu
Wenjie Qu
Xueshen Liu
Tianyang Tao
Yifan Qiao
Zhuang Wang
+7 more
Submitted
July 5, 2024
arXiv Category
cs.DC
arXiv PDF

Key Contributions

Lazarus is a system designed for resilient and elastic training of Mixture-of-Experts (MoE) models, addressing the challenges of frequent failures and preemptions, especially when using spot instances. It provides elasticity and fault tolerance specifically tailored for MoE architectures, overcoming limitations of existing solutions that rely on pipeline parallelism.

Business Value

Dramatically reduces the cost and improves the reliability of training massive MoE models by enabling the effective use of cheaper spot instances and minimizing downtime, making large-scale AI development more economically viable.