arxiv_ml 95% Match Research Paper ML engineers,Researchers training large models,Cloud infrastructure engineers,AI platform developers 1 week ago

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

large-language-models › training-methods

📄 Abstract

Abstract: Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs). However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to idle wait until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. This problem is exacerbated by the growing use of spot instances on public clouds for model training, which despite offering substantial cost savings, introduce frequent preemptions-essentially failures that regularly occur throughout the training process. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.

Authors (13)

Yongji Wu

Wenjie Qu

Xueshen Liu

Tianyang Tao

Yifan Qiao

Zhuang Wang

+7 more

Submitted

July 5, 2024

arXiv Category

cs.DC

arXiv PDF

Key Contributions

Lazarus is a system designed for resilient and elastic training of Mixture-of-Experts (MoE) models, addressing the challenges of frequent failures and preemptions, especially when using spot instances. It provides elasticity and fault tolerance specifically tailored for MoE architectures, overcoming limitations of existing solutions that rely on pipeline parallelism.

Business Value

Dramatically reduces the cost and improves the reliability of training massive MoE models by enabling the effective use of cheaper spot instances and minimizing downtime, making large-scale AI development more economically viable.

Paper Metadata

Innovation Type

System Design and Engineering

Deployment Feasibility

High for large-scale training environments, particularly on cloud platforms that offer spot instances. Requires integration into the training pipeline.

Limitations Addressed

Addresses the significant challenges of training large MoE models, including frequent failures (especially with spot instances), the high cost of downtime, and the lack of elasticity in current fault-tolerant training methods, particularly those incompatible with MoE's expert parallelism.

Performance Gains

Enables continued training despite failures and preemptions, significantly reducing overall training time and cost compared to systems that require restarts from checkpoints after every failure.

Technical Tags

Mixture-of-Experts (MoE)large language models (LLMs)fault-tolerant trainingelastic trainingspot instancesGPU preemptionpipeline parallelismexpert parallelismcheckpointingresilience

Research Topics

Scalable Deep Learning TrainingFault Tolerance in Distributed SystemsCloud Computing for MLEfficient Training of Large ModelsResilience Engineering

Methods & Architectures

Resilient trainingElastic trainingCheckpointingHandling spot instance preemptionsExpert parallelism strategy Mixture-of-Experts (MoE)

Applications & Tasks

Cloud Computing Large-scale AI Training Natural Language Processing Frequent failures during large-scale LLM trainingHigh cost of failures and restartsLack of elasticity in existing fault-tolerant training solutionsIncompatibility of pipeline parallelism with MoE expert parallelism Enabling resilient and elastic training of MoE modelsMinimizing training downtime due to failures/preemptionsLeveraging cost-effective spot instances for LLM training

Related Fields

Distributed SystemsCloud ComputingHigh-Performance ComputingMachine Learning Operations (MLOps)Reliability Engineering

Keywords

Mixture-of-ExpertsMoELLM trainingfault toleranceelasticityspot instancespreemptionresiliencecheckpointingexpert parallelismdistributed traininglarge models

Academic Context

#Scalable Deep Learning Training#Fault Tolerance in Distributed Systems#Cloud Computing for ML#Efficient Training of Large Models#Resilience Engineering

Companies & Organizations

Cloud Platforms

Technology Stack

ML Infrastructure

GPU clustersDistributed training frameworks

Commercial Potential

Potential Products

Resilient distributed training servicesCloud-optimized AI training platformsTools for managing large-scale model training

Target Industries

TechnologyCloud ComputingAI ResearchSaaS

Use Case Examples

Training state-of-the-art LLMs using cost-effective spot instancesEnsuring continuous training of MoE models despite intermittent hardware failuresReducing the overall time and cost for developing massive AI models

Competitive Edge

Provides a specialized solution for resilient and elastic training of MoE models, addressing limitations of general fault-tolerant methods and enabling efficient use of spot instances.

Resource Requirements

Compute Needs

High, requires large GPU clusters for training MoE models.

Data Requirements

Large datasets suitable for training LLMs.

Deployment Constraints

Primarily targets large-scale distributed training environments, especially cloud platforms with spot instances. Requires careful integration with existing training workflows.

Scalability

Designed for scalability to very large models and distributed training setups.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers