Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Finetuning large language models (LLMs) is essential for task adaptation, yet
today's serving stacks isolate inference and finetuning on separate GPU
clusters -- wasting resources and under-utilizing hardware. We introduce
FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning
on shared GPUs by fusing computation at the token level. FlexLLM's static
compilation optimizations -- dependent parallelization and graph pruning
significantly shrink activation memory, leading to end-to-end GPU memory
savings by up to 80%. At runtime, a novel token-level finetuning mechanism
paired with a hybrid token scheduler dynamically interleaves inference and
training tokens within each co-serving iteration, meeting strict latency SLOs
while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B,
Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at
up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under
heavy inference workloads and $2.5-6.8\times$ under light loads, preserving
over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly
available at https://flexllm.github.io.
Authors (12)
Gabriele Oliaro
Xupeng Miao
Xinhao Cheng
Vineeth Kada
Mengdi Wu
Ruohan Gao
+6 more
Submitted
February 29, 2024
Key Contributions
FlexLLM is the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. It employs static compilation optimizations (dependent parallelization, graph pruning) to reduce activation memory by up to 80% and a novel token-level scheduler to dynamically interleave inference and training tokens, meeting latency SLOs while maximizing utilization.
Business Value
Significantly reduces the cost and improves the efficiency of deploying and adapting large language models by enabling concurrent inference and finetuning on shared hardware, making LLM services more accessible and cost-effective.