arxiv_ml 95% Match Research Paper MLOps engineers,System architects,Researchers in distributed systems and AI infrastructure 5 days ago

HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

large-language-models › model-architecture

📄 Abstract

Abstract: Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs), from latency-sensitive online tasks like interactive chatbots to throughput-oriented offline workloads like data synthesis. The existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving SLOs. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor to estimate batch execution time and an SLO-aware profiler to quantify latency interference, and (2) SLO-aware offline scheduling policies that maximize serving throughput and prevent starvation. Our evaluation on production workloads shows that HyGen achieves up to 3.9-5.8x throughput gains over online and hybrid serving baselines, while ensuring latency SLOs. The code of HyGen is publicly available at https://github.com/UIUC-MLSys/HyGen.

Authors (3)

Ting Sun

Penghan Wang

Fan Lai

Submitted

January 15, 2025

arXiv Category

cs.DC

arXiv PDF

Key Contributions

Introduces HyGen, an interference-aware LLM serving system that efficiently co-locates online (latency-sensitive) and offline (throughput-oriented) workloads. It uses a latency predictor and SLO-aware profiler to manage interference and employs SLO-aware scheduling to maximize throughput.

Business Value

Significantly reduces operational costs for deploying LLMs by improving resource utilization and enabling mixed workload serving, making LLM applications more economically viable.

Paper Metadata

Innovation Type

System Design and Scheduling

Deployment Feasibility

High, as it is a system-level optimization designed for practical deployment in cloud environments.

Limitations Addressed

Addresses the poor resource utilization and inefficient deployment models that dedicate separate machines to online and offline LLM workloads.

Performance Gains

Achieves up to 3.9-5.8x throughput gains over online and hybrid serving baselines.

Technical Tags

LLM servingOnline-offline co-locationService Level Objectives (SLOs)Latency predictionInterference-aware schedulingThroughput maximizationResource utilizationBatch execution timeLatency interference

Research Topics

Distributed SystemsSystem OptimizationLarge Language ModelsResource ManagementPerformance Engineering

Methods & Architectures

Performance control mechanismsLatency predictorSLO-aware profilerSLO-aware offline scheduling Large Language Models (LLMs)

Applications & Tasks

Cloud Computing AI Infrastructure Software Services Efficient LLM ServingResource AllocationSLO Management Co-locating online and offline LLM workloadsMaximizing throughput while meeting SLOsImproving resource utilization in LLM deployments

Datasets & Benchmarks

Benchmarks

3.9-5.8x throughput gains over baselines

ThroughputLatencyResource utilizationSLO adherence

Related Fields

Systems EngineeringCloud ComputingMachine Learning Operations (MLOps)Distributed SystemsArtificial Intelligence

Keywords

LLM servingonline-offline co-locationSLOlatencythroughputresource utilizationschedulingdistributed systemsAI infrastructureperformance optimization

Academic Context

#Distributed Systems#System Optimization#Large Language Models#Resource Management#Performance Engineering

Technology Stack

ML Infrastructure

LLM serving systems

Commercial Potential

Potential Products

LLM serving platform optimization softwareCloud infrastructure management tools

Target Industries

Cloud ProvidersSaaS companiesTechnology companies deploying AI services

Use Case Examples

Serving both real-time chatbot requests and batch data synthesis tasks on the same hardware infrastructure.Optimizing GPU utilization for diverse LLM workloads.

Competitive Edge

Offers a more efficient approach to LLM serving compared to siloed deployments by enabling intelligent co-location and resource sharing.

Market Opportunity

Large and growing market for efficient AI infrastructure and LLM deployment solutions.

Revenue Models

Licensing of the HyGen systemintegration services.

Resource Requirements

Compute Needs

Leverages existing compute resources more efficiently.

Data Requirements

Requires production workload traces for evaluation and tuning.

Deployment Constraints

Requires integration with existing LLM serving infrastructure and careful monitoring of SLOs.

Scalability

Designed for scalability by improving resource utilization and enabling efficient handling of mixed workloads.

Production Readiness

Maturity Level

System Prototype/Research

Time to Market

1-3 years for adoption in cloud platforms.

View Full Paper Back to Papers