arxiv_ai 93% Match Research Paper ML Engineers,AI Researchers,Developers working with LLMs 2 weeks ago

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

large-language-models › model-architecture

📄 Abstract

Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

Authors (4)

Sibo Xiao

Jinyuan Fu

Zhongle Xie

Lidan Shou

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces TokenTiming, a novel algorithm for universal speculative decoding that overcomes the limitation of mismatched vocabularies between draft and target models. Inspired by Dynamic Time Warping, TokenTiming allows off-the-shelf models to be used without retraining by re-encoding draft tokens and using DTW to align probability distributions, significantly accelerating LLM inference.

Business Value

Significantly accelerating LLM inference reduces computational costs and latency, making large generative models more practical and cost-effective for real-time applications and wider deployment.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

The method works with any off-the-shelf models without retraining, making it highly feasible for integrating into existing LLM inference pipelines.

Limitations Addressed

The fundamental constraint in speculative decoding that draft and target models must share the same vocabulary, often requiring new model training.

Performance Gains

1.57x speedup (reported in abstract)

Technical Tags

speculative decodingLLM inferencedynamic time warpingtoken alignmentmismatched vocabulariesuniversal speculative decodingdraft modeltarget modelgenerative AIinference acceleration

Research Topics

Efficient LLM InferenceSpeculative Decoding TechniquesCross-Vocabulary Model IntegrationAlgorithmic Acceleration of Generation

Methods & Architectures

TokenTiming AlgorithmDynamic Time Warping (DTW)Re-encoding draft token sequenceDTW-based mappingComprehensive Experiments Large Language Models (LLMs)Draft ModelsTarget Models

Applications & Tasks

Generative AI Natural Language Generation AI Inference Optimization Slow LLM inferenceConstraint of shared vocabulary in speculative decodingNeed for retraining models for speculative decoding Accelerating LLM InferenceUniversal Speculative DecodingText Generation

Related Fields

Algorithm DesignTime Series AnalysisMachine Learning Optimization

Keywords

speculative decodingLLM inferenceinference accelerationdynamic time warpingtoken alignmentmismatched vocabularyuniversalgenerative AIdraft modeltarget modelDTWalgorithm

Academic Context

#Efficient LLM Inference#Speculative Decoding Techniques#Cross-Vocabulary Model Integration#Algorithmic Acceleration of Generation

Commercial Potential

Potential Products

LLM inference optimization librariesReal-time generative AI services

Target Industries

TechnologySaaSAI Development

Use Case Examples

Faster chatbots and virtual assistantsReal-time content generationAccelerated code completion tools

Competitive Edge

Offers a universal approach to speculative decoding that bypasses the need for model retraining or vocabulary matching, differentiating it from other speculative decoding methods.

Resource Requirements

Compute Needs

Moderate (for inference acceleration)

Data Requirements

N/A (works with existing models)

Deployment Constraints

Requires access to both draft and target models' tokenization and probability distributions.

Scalability

The DTW-based alignment is efficient, suggesting good scalability.

Production Readiness

Maturity Level

Algorithmic Development

View Full Paper Back to Papers