Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper ML Engineers,AI Researchers,Developers working with LLMs 2 weeks ago

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

large-language-models › model-architecture
📄 Abstract

Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
Authors (4)
Sibo Xiao
Jinyuan Fu
Zhongle Xie
Lidan Shou
Submitted
October 17, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper introduces TokenTiming, a novel algorithm for universal speculative decoding that overcomes the limitation of mismatched vocabularies between draft and target models. Inspired by Dynamic Time Warping, TokenTiming allows off-the-shelf models to be used without retraining by re-encoding draft tokens and using DTW to align probability distributions, significantly accelerating LLM inference.

Business Value

Significantly accelerating LLM inference reduces computational costs and latency, making large generative models more practical and cost-effective for real-time applications and wider deployment.