arxiv_ml 90% Match Research Paper ML engineers,System architects,Researchers in efficient deep learning,Cloud providers 1 week ago

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

large-language-models › model-architecture

📄 Abstract

Abstract: Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under heavy inference workloads and $2.5-6.8\times$ under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.

Authors (12)

Gabriele Oliaro

Xupeng Miao

Xinhao Cheng

Vineeth Kada

Mengdi Wu

Ruohan Gao

+6 more

Submitted

February 29, 2024

arXiv Category

cs.DC

arXiv PDF

Key Contributions

FlexLLM is the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. It employs static compilation optimizations (dependent parallelization, graph pruning) to reduce activation memory by up to 80% and a novel token-level scheduler to dynamically interleave inference and training tokens, meeting latency SLOs while maximizing utilization.

Business Value

Significantly reduces the cost and improves the efficiency of deploying and adapting large language models by enabling concurrent inference and finetuning on shared hardware, making LLM services more accessible and cost-effective.

Paper Metadata

Innovation Type

System Design and Optimization

Deployment Feasibility

High, as it's a system-level innovation designed to optimize existing hardware usage. Requires integration into existing LLM serving infrastructure.

Limitations Addressed

Addresses the inefficiency and resource wastage caused by isolating LLM inference and finetuning on separate GPU clusters, leading to under-utilization of hardware.

Performance Gains

Up to 80% end-to-end GPU memory savings; improves finetuning throughput by 1.9-4.8x under heavy inference load, while maintaining inference SLO compliance at up to 20 req/s.

Technical Tags

LLM inferenceLLM finetuningco-servingGPU utilizationPEFTtoken-level schedulinglatency SLOthroughputactivation memorydynamic interleaving

Research Topics

Efficient LLM ServingResource Optimization for Deep LearningJoint Inference and TrainingLarge Language Model AdaptationSystem Design for AI Workloads

Methods & Architectures

Token-level co-servingDependent parallelizationGraph pruningToken-level finetuning mechanismHybrid token schedulerDynamic interleaving LLMs (e.g., LLaMA, Qwen)PEFT (Parameter-Efficient Fine-Tuning)

Applications & Tasks

Cloud Computing AI Services Natural Language Processing Applications Inefficient resource utilization in LLM serving stacksHigh latency and cost for LLM inference and finetuningDifficulty in co-serving inference and finetuning workloads Serving LLM inference and finetuning concurrentlyMaximizing GPU utilizationMeeting strict latency Service Level Objectives (SLOs)

Datasets & Benchmarks

Benchmarks

LLaMA-3.1-8B • Qwen-2.5-14B • Qwen-2.5-32B

Inference latency (SLO compliance)Finetuning throughputGPU memory savingsGPU utilization

Related Fields

Distributed SystemsHigh-Performance ComputingMachine Learning Operations (MLOps)Compiler DesignResource Management

Keywords

LLM servingfinetuninginferenceco-servingGPU optimizationPEFTtoken schedulinglatency SLOthroughputresource utilizationlarge language modelssystem design

Academic Context

#Efficient LLM Serving#Resource Optimization for Deep Learning#Joint Inference and Training#Large Language Model Adaptation#System Design for AI Workloads

Technology Stack

ML Infrastructure

GPU clusters

Commercial Potential

Potential Products

Optimized LLM serving platformsCloud-based AI model adaptation servicesCost-effective LLM deployment solutions

Target Industries

TechnologySaaSCloud ComputingAI Development

Use Case Examples

Simultaneously serving customer-facing chatbots and performing task-specific finetuningEnabling rapid iteration on LLM-based applications by combining inference and adaptationReducing cloud compute costs for LLM providers

Competitive Edge

First system to achieve token-level co-serving of LLM inference and finetuning, offering significant efficiency gains over separate serving stacks.

Resource Requirements

Compute Needs

Requires GPUs capable of handling large LLMs; optimized for efficient utilization of these resources.

Data Requirements

Requires models for inference and data for finetuning.

Deployment Constraints

Requires integration with existing LLM serving infrastructure; performance depends on the specific LLM architecture and workload.

Scalability

Designed to improve scalability by maximizing resource utilization; scales with the number of GPUs available.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers