Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper ML Engineers,AI Researchers,Inference Optimization Specialists,Developers of LLM applications 3 weeks ago

3-Model Speculative Decoding

large-language-models › model-architecture
📄 Abstract

Abstract: Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a trade-off between draft model size and token acceptance: smaller draft models generate tokens more quickly but exhibit greater divergence from the target model, resulting in lower acceptance rates and reduced speedups. We introduce Pyramid Speculative Decoding (PyramidSD), an extension of SD that inserts an intermediate qualifier model between the draft and target to bridge the distributional gap in output predictions, allowing smaller model to be used for drafting. This hierarchical decoding strategy improves alignment across models, enabling higher acceptance rates and allowing the use of significantly smaller draft models without sacrificing overall performance. PyramidSD builds on fuzzy acceptance criteria to support relaxed divergence thresholds at each stage, improving throughput. In experiments, PyramidSD achieves up to 1.91x generation speed over standard SD, reaching 124 tokens per second on a consumer GPU (RTX 4090). In small-memory settings with a 1B-parameter draft model and an 8B target model, PyramidSD minimally trades target model quality for improved throughput. Overall, PyramidSD offers a practical approach to enhancing speculative decoding efficiency and can be readily applied to existing inference pipelines.
Authors (6)
Sanghyun Byun
Mohanad Odema
Jung Ick Guack
Baisub Lee
Jacob Song
Woo Seong Chung
Submitted
October 14, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces Pyramid Speculative Decoding (PyramidSD), an extension of Speculative Decoding that uses an intermediate qualifier model to bridge the distributional gap between a small draft model and a large target model. This hierarchical strategy improves token acceptance rates, allows the use of significantly smaller draft models without sacrificing performance, and enhances overall inference throughput.

Business Value

Significantly reduces the computational cost and latency of LLM inference, making large models more practical and cost-effective for real-time applications and high-volume deployments.