arxiv_cl 92% Match Research Paper ML Engineers,AI Researchers,Hardware Engineers,LLM Developers 2 weeks ago

AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

large-language-models › model-architecture

📄 Abstract

Abstract: Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models. Nevertheless, minimizing the accuracy degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. While scalar quantization is constrained by 1-bit bound, vector quantization exploits intra-vector correlations and enables sub-bit regimes, making it more suitable for ultra-low-bit quantization. To further mitigate quantization-induced degradation, we reveal that the degradation is highly uneven across tokens in attention quality. To investigate this unevenness, we introduce anchor score to measure each token's sensitivity to quantization. Our analysis and experiments show that preserving a small subset (1\%) of tokens with the highest Anchor Score significantly mitigates accuracy loss under aggressive quantization. We propose AnTKV, a dual-stage framework that leverages anchor token-aware vector quantization to compress the KV cache. It combines offline token-aware centroids learning and online anchor token selection to balance compression and accuracy. To enable efficient deployment, we design an online anchor token selection kernel compatible with FlashAttention. It allows LLaMA3-8B to scale to 840K tokens on a single 80GB A100, while delivering up to $3.5\times$ higher decoding throughput over the FP16 baseline. Experiments demonstrate that AnTKV matches or surpasses prior methods at 4-bit, and significantly reduce perplexity under ultra-low-bit quantization, achieving 6.32 at 1-bit on Mistral-7B, compared to 7.25 for CQ and 15.36 for KVQuant.

Authors (9)

Zeyu Li

Chuanfu Xiao

Yang Wang

Xiang Liu

Zhenheng Tang

Baotong Lu

+3 more

Submitted

June 24, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

AnTKV proposes an anchor token-aware vector quantization method for KV cache in LLMs to mitigate accuracy loss during ultra-low-bit quantization. By identifying and preserving sensitive 'anchor' tokens, it significantly reduces accuracy degradation while achieving substantial memory savings.

Business Value

Enables the deployment of larger and more capable LLMs on hardware with limited memory, reducing operational costs and expanding accessibility.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, as quantization is a common technique for model optimization and deployment.

Limitations Addressed

The significant accuracy degradation caused by ultra-low-bit quantization of the KV cache in Large Language Models.

Performance Gains

Significantly mitigates accuracy loss under aggressive quantization.

Technical Tags

KV cache quantizationvector quantizationsub-bit quantizationlarge language modelsaccuracy degradationanchor tokensattention qualitymemory footprint reduction

Research Topics

Efficient LLM InferenceQuantization TechniquesMemory OptimizationAttention Mechanism AnalysisModel Compression

Methods & Architectures

Anchor token identificationVector quantizationDual-stage frameworkSub-bit quantization Large Language Models (LLMs)KV Cache

Applications & Tasks

Natural Language Processing Machine Learning Optimization Accuracy degradation in ultra-low-bit KV cache quantizationMinimizing memory footprint of LLMsUneven token sensitivity to quantization Reducing LLM memory usageQuantizing KV cacheImproving LLM inference efficiency

Related Fields

Machine Learning OptimizationComputer ArchitectureNatural Language Processing

Keywords

AnTKVKV CacheQuantizationVector QuantizationSub-bitLLMMemory FootprintAnchor TokensAttentionInference Efficiency

Academic Context

#Efficient LLM Inference#Quantization Techniques#Memory Optimization#Attention Mechanism Analysis#Model Compression

Commercial Potential

Potential Products

Optimized LLM librariesQuantization tools for LLMs

Target Industries

TechnologyCloud ComputingAI Services

Use Case Examples

Running LLMs on edge devicesReducing server costs for LLM inference

Competitive Edge

Offers a more effective approach to ultra-low-bit KV cache quantization by focusing on token sensitivity, aiming to achieve better accuracy-performance trade-offs than existing scalar or standard vector quantization methods.

Market Opportunity

Significant market for efficient LLM deployment and optimization.

Revenue Models

Licensing of the quantization techniqueintegration into inference frameworks.

Resource Requirements

Compute Needs

Primarily focused on reducing memory requirements, compute impact is secondary.

Data Requirements

Requires large text datasets for training/fine-tuning quantization parameters.

Deployment Constraints

Quantization can sometimes introduce subtle biases or performance shifts.

Scalability

Aims to improve scalability by reducing memory bottlenecks.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Moderate, for the anchor token identification and quantization strategy.

View Full Paper Back to Papers