Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Speculative decoding (SD) has emerged as an effective technique to accelerate
large language model (LLM) inference without compromising output quality.
However, the achievable speedup largely depends on the effectiveness of the
drafting model. While model-based methods like EAGLE-2 are accurate but costly,
retrieval-enhanced methods like SAM-Decoding rely on heuristic switching
strategies that often trigger unnecessary retrievals. To address this, we
propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a
novel framework that transforms heuristic drafter switching into adaptive
decision-making. ReSpec features three core innovations: 1) An
\textbf{entropy-guided adaptive trigger} quantifies contextual predictability
to initiate retrieval only when uncertainty is low, avoiding costly low-quality
speculations. 2) A \textbf{feedback-driven candidate selection} leverages
historical feedback to organize multiple high-quality candidates for parallel
verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed
verification strategy} applies strict checks to model-generated drafts while
using a relaxed verification for retrieved drafts, achieving a better balance
between accuracy and efficiency. Extensive experiments on Spec-Bench
demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming
EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while
maintaining output quality.
Authors (4)
Min Fang
Zhihui Fu
Qibin Zhao
Jun Wang
Submitted
November 3, 2025
Key Contributions
ReSpec proposes a novel framework for retrieval-enhanced speculative decoding that transforms heuristic switching into adaptive decision-making. It introduces an entropy-guided adaptive trigger to initiate retrieval only when contextual predictability is low, and a feedback-driven candidate selection mechanism to organize high-quality candidates for parallel verification, aiming to improve LLM inference speed and quality.
Business Value
Accelerating LLM inference can significantly reduce operational costs for businesses deploying LLM-powered applications, leading to faster response times and improved user experience.