Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Accelerating the inference of large language models (LLMs) has been a
critical challenge in generative AI. Speculative decoding (SD) substantially
improves LLM inference efficiency. However, its utility is limited by a
fundamental constraint: the draft and target models must share the same
vocabulary, thus limiting the herd of available draft models and often
necessitating the training of a new model from scratch. Inspired by Dynamic
Time Warping (DTW), a classic algorithm for aligning time series, we propose
the algorithm TokenTiming for universal speculative decoding. It operates by
re-encoding the draft token sequence to get a new target token sequence, and
then uses DTW to build a mapping to transfer the probability distributions for
speculative sampling. Benefiting from this, our method accommodates mismatched
vocabularies and works with any off-the-shelf models without retraining and
modification. We conduct comprehensive experiments on various tasks,
demonstrating 1.57x speedup. This work enables a universal approach for draft
model selection, making SD a more versatile and practical tool for LLM
acceleration.
Authors (4)
Sibo Xiao
Jinyuan Fu
Zhongle Xie
Lidan Shou
Submitted
October 17, 2025
Key Contributions
This paper introduces TokenTiming, a novel algorithm for universal speculative decoding that overcomes the limitation of mismatched vocabularies between draft and target models. Inspired by Dynamic Time Warping, TokenTiming allows off-the-shelf models to be used without retraining by re-encoding draft tokens and using DTW to align probability distributions, significantly accelerating LLM inference.
Business Value
Significantly accelerating LLM inference reduces computational costs and latency, making large generative models more practical and cost-effective for real-time applications and wider deployment.