Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper ML Engineers,AI Infrastructure Engineers,Researchers in LLM Inference,System Architects 1 week ago

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

large-language-models › model-architecture
📄 Abstract

Abstract: Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
Authors (5)
Mingyu Yang
Jae-Young Choi
Kihyo Moon
Minsung Jang
Eunjoo Jeon
Submitted
September 1, 2025
arXiv Category
cs.DC
arXiv PDF

Key Contributions

This paper introduces Dynamic Speculative Decoding Engine (DSDE), a training-free framework that dynamically adapts speculative decoding for large-batch LLM serving. DSDE uses a KLD variance-based predictive signal to diagnose generation stability and an adaptive speculation length cap to mitigate the straggler problem, achieving competitive latency and superior robustness across diverse workloads.

Business Value

Reducing inference latency and improving robustness in LLM serving is critical for applications requiring real-time responses, such as chatbots, virtual assistants, and content generation services. DSDE can lead to significant cost savings and improved user satisfaction by enabling more efficient and reliable LLM deployments.