Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Speculative Decoding (SD) accelerates inference in large language models by
using a smaller draft model to propose tokens, which are then verified by a
larger target model. However, the throughput gains of SD are fundamentally
limited by a trade-off between draft model size and token acceptance: smaller
draft models generate tokens more quickly but exhibit greater divergence from
the target model, resulting in lower acceptance rates and reduced speedups. We
introduce Pyramid Speculative Decoding (PyramidSD), an extension of SD that
inserts an intermediate qualifier model between the draft and target to bridge
the distributional gap in output predictions, allowing smaller model to be used
for drafting. This hierarchical decoding strategy improves alignment across
models, enabling higher acceptance rates and allowing the use of significantly
smaller draft models without sacrificing overall performance. PyramidSD builds
on fuzzy acceptance criteria to support relaxed divergence thresholds at each
stage, improving throughput. In experiments, PyramidSD achieves up to 1.91x
generation speed over standard SD, reaching 124 tokens per second on a consumer
GPU (RTX 4090). In small-memory settings with a 1B-parameter draft model and an
8B target model, PyramidSD minimally trades target model quality for improved
throughput. Overall, PyramidSD offers a practical approach to enhancing
speculative decoding efficiency and can be readily applied to existing
inference pipelines.
Authors (6)
Sanghyun Byun
Mohanad Odema
Jung Ick Guack
Baisub Lee
Jacob Song
Woo Seong Chung
Submitted
October 14, 2025
Key Contributions
Introduces Pyramid Speculative Decoding (PyramidSD), an extension of Speculative Decoding that uses an intermediate qualifier model to bridge the distributional gap between a small draft model and a large target model. This hierarchical strategy improves token acceptance rates, allows the use of significantly smaller draft models without sacrificing performance, and enhances overall inference throughput.
Business Value
Significantly reduces the computational cost and latency of LLM inference, making large models more practical and cost-effective for real-time applications and high-volume deployments.