Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 90% Match Research Paper MLOps Engineers,System Administrators,Researchers in LLM Serving,Developers building LLM applications 4 weeks ago

PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank

large-language-models › evaluation
📄 Abstract

Abstract: Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

Key Contributions

Introduces PARS, a prompt-aware LLM task scheduler that approximates shortest-job-first scheduling using pairwise learning-to-rank with margin ranking loss. Integrated into vLLM, PARS predicts response-length-based task ordering to reduce latency with minimal overhead, significantly improving performance across multiple LLMs and real-world inference datasets, especially for reasoning workloads.

Business Value

Dramatically improves the efficiency and cost-effectiveness of serving LLMs, enabling more responsive and scalable AI-powered applications and services.