Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Efficient scheduling of LLM inference tasks is essential for achieving low
latency and high throughput, particularly with the growing use of
reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve
(FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks
delay shorter ones queued behind them. In this paper, we introduce PARS, a
prompt-aware LLM task scheduler that improves serving efficiency by
approximating shortest-job-first (SJF) scheduling through pairwise ranking with
margin ranking loss. PARS focuses on impactful scheduling decisions and is
seamlessly integrated into the state-of-the-art LLM serving system vLLM. It
effectively predicts response-length-based task ordering, reducing latency with
minimal overhead. Extensive experiments across multiple LLMs and real-world
inference datasets show that PARS significantly improves performance, including
for reasoning workloads. Furthermore, our cross-model evaluations demonstrate
that the design generalizes well, enabling effective scheduling even when
predictors are trained on different LLMs.
Key Contributions
Introduces PARS, a prompt-aware LLM task scheduler that approximates shortest-job-first scheduling using pairwise learning-to-rank with margin ranking loss. Integrated into vLLM, PARS predicts response-length-based task ordering to reduce latency with minimal overhead, significantly improving performance across multiple LLMs and real-world inference datasets, especially for reasoning workloads.
Business Value
Dramatically improves the efficiency and cost-effectiveness of serving LLMs, enabling more responsive and scalable AI-powered applications and services.