arxiv_ai 95% Match Research Paper ML Engineers,Researchers in LLM Inference,System Architects 1 week ago

Batch Speculative Decoding Done Right

large-language-models › model-architecture

📄 Abstract

Abstract: Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3$\times$ throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.

Authors (6)

Ranran Haoran Zhang

Soumik Dey

Ashirbad Mishra

Hansi Wu

Binbin Li

Rui Zhang

Submitted

October 26, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper addresses the challenges of batching speculative decoding for LLM inference, specifically the ragged tensor problem that corrupts positional information and attention masks. It introduces EQSPEC, a correctness-first approach, and EXSPEC, which optimizes overhead by maintaining a sliding pool, ensuring output equivalence with standard autoregressive generation.

Business Value

Significantly speeds up LLM inference for production serving by enabling efficient batching, leading to lower operational costs and improved user experience.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, as it directly optimizes existing inference pipelines.

Limitations Addressed

Existing batch speculative decoding implementations violate output equivalence due to improper handling of the ragged tensor problem, leading to incorrect outputs. This work provides a correct and efficient solution.

Performance Gains

40% overhead reduction (implied by EXSPEC)

Technical Tags

speculative decodingLLM inferencebatch processingragged tensorsoutput equivalencedraft modeltarget modelKV-cacheattention masksposition IDs

Research Topics

Efficient LLM InferenceBatching StrategiesCorrectness GuaranteesModel Optimization

Methods & Architectures

Batch Speculative DecodingEQSPECEXSPEC Draft ModelTarget Model

Applications & Tasks

Large Language Model Serving Production Inference Inference LatencyBatching InefficiencyCorrectness Violations Accelerated LLM InferenceBatch Speculative Decoding

Related Fields

Natural Language ProcessingMachine Learning SystemsDistributed Systems

Keywords

speculative decodingLLMinferencebatchingragged tensoroutput equivalencedraft modeltarget modelKV cacheattentionpositional encodingreal-timeservingoptimization

Academic Context

#Efficient LLM Inference#Batching Strategies#Correctness Guarantees#Model Optimization

Commercial Potential

Potential Products

Optimized LLM Inference Engines

Target Industries

TechnologySaaSAI Services

Use Case Examples

Real-time chatbotsContent generation servicesCode completion tools

Competitive Edge

Offers a more correct and potentially more efficient alternative to existing batch speculative decoding methods.

Market Opportunity

Large, due to the widespread use of LLMs.

Revenue Models

Enabling more cost-effective LLM services.

Resource Requirements

Compute Needs

Standard GPU inference hardware

Data Requirements

None specified for the method itself, but requires LLMs for testing.

Deployment Constraints

Requires careful implementation to ensure correctness.

Scalability

Designed for production serving, implying scalability.

Production Readiness

Maturity Level

Research

Time to Market

Medium (implementation and integration)

View Full Paper Back to Papers