Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) have facilitated a wide range of applications
with distinct service-level objectives (SLOs), from latency-sensitive online
tasks like interactive chatbots to throughput-oriented offline workloads like
data synthesis. The existing deployment model, which dedicates machines to each
workload, simplifies SLO management but often leads to poor resource
utilization. This paper introduces HyGen, an interference-aware LLM serving
system that enables efficient co-location of online and offline workloads while
preserving SLOs. HyGen incorporates two key innovations: (1) performance
control mechanisms, including a latency predictor to estimate batch execution
time and an SLO-aware profiler to quantify latency interference, and (2)
SLO-aware offline scheduling policies that maximize serving throughput and
prevent starvation. Our evaluation on production workloads shows that HyGen
achieves up to 3.9-5.8x throughput gains over online and hybrid serving
baselines, while ensuring latency SLOs. The code of HyGen is publicly available
at https://github.com/UIUC-MLSys/HyGen.
Authors (3)
Ting Sun
Penghan Wang
Fan Lai
Submitted
January 15, 2025
Key Contributions
Introduces HyGen, an interference-aware LLM serving system that efficiently co-locates online (latency-sensitive) and offline (throughput-oriented) workloads. It uses a latency predictor and SLO-aware profiler to manage interference and employs SLO-aware scheduling to maximize throughput.
Business Value
Significantly reduces operational costs for deploying LLMs by improving resource utilization and enabling mixed workload serving, making LLM applications more economically viable.