Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language model (LLM) serving demands low latency and high throughput,
but high load variability makes it challenging to achieve high GPU utilization.
In this paper, we identify a synergetic but overlooked opportunity to co-serve
latency-critical online requests alongside latency-tolerant offline tasks such
as model benchmarking. While promising, existing serving systems fail to
co-serve them efficiently, as their coarse-grained resource management at the
request or iteration level cannot harvest millisecond-level GPU idle cycles
without introducing interference that violates online latency objectives.
ConServe is a new LLM co-serving system that achieves high throughput and
strong online latency guarantees by managing resources at finer granularities.
ConServe introduces three techniques: (1) a latency-aware token-level scheduler
that precisely sizes offline batches and tokens to fit within online latency
objectives; (2) sub-iteration, layer-wise preemption that allows offline tasks
to yield to online load spikes; and (3) incremental KV cache management that
enables preempting and resuming offline requests at near-zero cost. Evaluations
with Llama-3.1 and Qwen-2.5 models on real-world workloads show that ConServe
delivers an average of 2.2$\times$ higher throughput and reduces online serving
tail latency by 2.9$\times$ on average compared to state-of-the-art systems.