arxiv_ai 90% Match Research Paper MLOps Engineers,System Administrators,Researchers in LLM Serving,Developers building LLM applications 4 weeks ago

PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank

large-language-models › evaluation

📄 Abstract

Abstract: Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

Key Contributions

Introduces PARS, a prompt-aware LLM task scheduler that approximates shortest-job-first scheduling using pairwise learning-to-rank with margin ranking loss. Integrated into vLLM, PARS predicts response-length-based task ordering to reduce latency with minimal overhead, significantly improving performance across multiple LLMs and real-world inference datasets, especially for reasoning workloads.

Business Value

Dramatically improves the efficiency and cost-effectiveness of serving LLMs, enabling more responsive and scalable AI-powered applications and services.

Paper Metadata

Innovation Type

Novel Scheduling Algorithm

Deployment Feasibility

High, as it integrates seamlessly with existing state-of-the-art serving systems like vLLM.

Limitations Addressed

Head-of-Line (HOL) blocking in traditional LLM serving strategies (like FCFS) that causes long-running tasks to delay shorter ones.

Performance Gains

Significant improvement in latency and throughput

Technical Tags

LLM ServingLow LatencyHigh ThroughputPrompt-Aware SchedulingPairwise Learning-to-RankShortest-Job-First (SJF)Head-of-Line (HOL) BlockingvLLM IntegrationResponse Length Prediction

Research Topics

Efficient LLM InferenceLLM Scheduling OptimizationReducing Latency in LLM ServingImproving Throughput

Methods & Architectures

Pairwise Learning-to-RankMargin Ranking LossPrompt-Aware SchedulingResponse Length Prediction Large Language Models (LLMs)

Applications & Tasks

Cloud Computing API Services Real-time AI Applications Head-of-Line (HOL) BlockingInefficient Task SchedulingHigh Latency in LLM InferenceSuboptimal Throughput LLM Inference SchedulingMinimizing LatencyMaximizing ThroughputApproximating Shortest-Job-First Scheduling

Related Fields

LLM ServingDistributed SystemsMachine Learning Operations (MLOps)Optimization AlgorithmsComputer Systems

Keywords

LLM ServingLow LatencyHigh ThroughputPARSPrompt-Aware SchedulingLearning-to-RankShortest-Job-FirstHOL BlockingvLLMInference OptimizationReasoning WorkloadsTask Scheduling

Academic Context

#Efficient LLM Inference#LLM Scheduling Optimization#Reducing Latency in LLM Serving#Improving Throughput

Technology Stack

Frameworks & Libraries

vLLM

ML Infrastructure

LLM Serving Systems

Commercial Potential

Potential Products

Optimized LLM Inference ServersLLM API Gateway Enhancements

Target Industries

TechnologyCloud ServicesSaaSAI Platforms

Use Case Examples

Providing faster responses for chatbotsEnabling real-time AI assistantsImproving the scalability of LLM-based services

Competitive Edge

Offers a novel, prompt-aware scheduling approach that significantly outperforms traditional methods like FCFS and is integrated into a leading serving system (vLLM).

Market Opportunity

Large and growing market for efficient LLM serving solutions.

Revenue Models

Licensing of scheduling technologyoffering optimized serving solutions.

Resource Requirements

Compute Needs

Minimal overhead on serving infrastructure.

Data Requirements

Requires inference datasets for training the scheduler.

Deployment Constraints

Requires integration with compatible LLM serving frameworks (e.g., vLLM).

Scalability

Designed to improve throughput and latency, implying scalability for serving systems.

Production Readiness

Maturity Level

Research (integrated into vLLM)

Time to Market

3-9 months (for integration into production systems)

Patent Potential

Medium (novel scheduling algorithm)

View Full Paper Back to Papers