Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Speculative decoding accelerates large language model inference, but its
reliance on a fixed speculation length is suboptimal in large-batch serving
environments with diverse requests. This paper explores a new direction for
dynamic adaptation by investigating a novel class of post-hoc, diagnostic
signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free
framework built on two primary components: (1) a predictive signal based on the
variance of the Kullback-Leibler (KLD) divergence, which diagnoses the
generation's regional stability, and (2) an adaptive speculation length cap to
mitigate the straggler problem in per-sequence decoding. Experiments
demonstrate the potential of using KLD-based stability signals for dynamic
adaptation. An algorithm guided by these signals achieves end-to-end latency
competitive with leading baselines and exhibits superior robustness across
diverse workloads. This robustness is particularly valuable in challenging
low-acceptance-rate regimes, where the proposed signal maintains its diagnostic
utility. Collectively, these findings validate post-hoc signals as a valuable
component for building more robust and intelligent LLM inference systems, and
highlight a promising direction for future research on dynamic speculation
length adaptation.
Authors (5)
Mingyu Yang
Jae-Young Choi
Kihyo Moon
Minsung Jang
Eunjoo Jeon
Submitted
September 1, 2025
Key Contributions
This paper introduces Dynamic Speculative Decoding Engine (DSDE), a training-free framework that dynamically adapts speculative decoding for large-batch LLM serving. DSDE uses a KLD variance-based predictive signal to diagnose generation stability and an adaptive speculation length cap to mitigate the straggler problem, achieving competitive latency and superior robustness across diverse workloads.
Business Value
Reducing inference latency and improving robustness in LLM serving is critical for applications requiring real-time responses, such as chatbots, virtual assistants, and content generation services. DSDE can lead to significant cost savings and improved user satisfaction by enabling more efficient and reliable LLM deployments.