Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Quantization has emerged as an effective and lightweight solution to reduce
the memory footprint of the KV cache in Large Language Models. Nevertheless,
minimizing the accuracy degradation caused by ultra-low-bit KV cache
quantization remains a significant challenge. While scalar quantization is
constrained by 1-bit bound, vector quantization exploits intra-vector
correlations and enables sub-bit regimes, making it more suitable for
ultra-low-bit quantization. To further mitigate quantization-induced
degradation, we reveal that the degradation is highly uneven across tokens in
attention quality. To investigate this unevenness, we introduce anchor score to
measure each token's sensitivity to quantization. Our analysis and experiments
show that preserving a small subset (1\%) of tokens with the highest Anchor
Score significantly mitigates accuracy loss under aggressive quantization.
We propose AnTKV, a dual-stage framework that leverages anchor token-aware
vector quantization to compress the KV cache. It combines offline token-aware
centroids learning and online anchor token selection to balance compression and
accuracy. To enable efficient deployment, we design an online anchor token
selection kernel compatible with FlashAttention. It allows LLaMA3-8B to scale
to 840K tokens on a single 80GB A100, while delivering up to $3.5\times$ higher
decoding throughput over the FP16 baseline. Experiments demonstrate that AnTKV
matches or surpasses prior methods at 4-bit, and significantly reduce
perplexity under ultra-low-bit quantization, achieving 6.32 at 1-bit on
Mistral-7B, compared to 7.25 for CQ and 15.36 for KVQuant.
Authors (9)
Zeyu Li
Chuanfu Xiao
Yang Wang
Xiang Liu
Zhenheng Tang
Baotong Lu
+3 more
Key Contributions
AnTKV proposes an anchor token-aware vector quantization method for KV cache in LLMs to mitigate accuracy loss during ultra-low-bit quantization. By identifying and preserving sensitive 'anchor' tokens, it significantly reduces accuracy degradation while achieving substantial memory savings.
Business Value
Enables the deployment of larger and more capable LLMs on hardware with limited memory, reducing operational costs and expanding accessibility.