Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Speculative decoding speeds up LLM inference by using a small draft model to
propose multiple tokens that a target model verifies in parallel. Extending
this idea to batches is essential for production serving, but it introduces the
ragged tensor problem: sequences in the same batch accept different numbers of
draft tokens, breaking right-alignment and corrupting position IDs, attention
masks, and KV-cache state. We show that several existing batch implementations
violate output equivalence-the fundamental requirement that speculative
decoding must produce identical token sequences to standard autoregressive
generation. These violations occur precisely due to improper handling of the
ragged tensor problem. In response, we (1) characterize the synchronization
requirements that guarantee correctness, (2) present a correctness-first batch
speculative decoding EQSPEC that exposes realignment as consuming 40% of
overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences
and dynamically forms same-length groups, to reduce the realignment overhead
while preserving per-sequence speculative speedups. On the SpecBench dataset,
across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our
approach achieves up to 3$\times$ throughput improvement at batch size 8
compared to batch size 1, with efficient scaling through batch size 8, while
maintaining 95% output equivalence. Our method requires no custom kernels and
integrates cleanly with existing inference stacks. Our code is available at
https://github.com/eBay/spec_dec.
Authors (6)
Ranran Haoran Zhang
Soumik Dey
Ashirbad Mishra
Hansi Wu
Binbin Li
Rui Zhang
Submitted
October 26, 2025
Key Contributions
This paper addresses the challenges of batching speculative decoding for LLM inference, specifically the ragged tensor problem that corrupts positional information and attention masks. It introduces EQSPEC, a correctness-first approach, and EXSPEC, which optimizes overhead by maintaining a sliding pool, ensuring output equivalence with standard autoregressive generation.
Business Value
Significantly speeds up LLM inference for production serving by enabling efficient batching, leading to lower operational costs and improved user experience.