arxiv_ml 95% Match Research Paper ML Engineers,Researchers in LLMs,AI Infrastructure Developers 1 week ago

SALS: Sparse Attention in Latent Space for KV cache Compression

large-language-models › model-architecture

📄 Abstract

Abstract: Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.

Authors (6)

Junlin Mu

Hantao Huang

Jihang Zhang

Minghui Yu

Tao Wang

Yidong Li

Submitted

October 28, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper proposes SALS (Sparse Attention in Latent Space) to address KV cache compression challenges in LLMs. It leverages insights that RoPE increases key vector variance and that keys maintain representation in latent space, enabling effective compression without severe accuracy loss or new bottlenecks, unlike naive low-rank methods.

Business Value

Significantly reduces the computational cost and memory footprint of deploying large language models, making them more accessible and affordable for a wider range of applications and hardware.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, as it directly addresses inference efficiency, a critical bottleneck for LLM deployment.

Limitations Addressed

Naive low-rank compression of KV cache in LLMs leads to accuracy degradation or new speed bottlenecks due to RoPE. SALS overcomes this by operating in latent space and using sparse attention, preserving accuracy and efficiency.

Performance Gains

Effective KV cache compression,Reduced memory bandwidth,Maintained accuracy

Technical Tags

Large Language Models (LLMs)KV cache compressionInference optimizationMemory bandwidthRotary Position Embedding (RoPE)Low-rank approximationSparse AttentionLatent Space

Research Topics

Large Language ModelsModel CompressionEfficient InferenceAttention MechanismsMemory Management

Methods & Architectures

Sparse Attention in Latent Space (SALS)Low-rank approximationKV cache compression techniques Transformer (implied)Large Language Models

Applications & Tasks

Natural Language Processing Large Language Model Inference High memory bandwidth requirements in LLM inferenceLarge KV cache sizeAccuracy degradation with naive compression Compressing KV cache for LLMsReducing memory bandwidth usage during inferenceMaintaining accuracy with KV cache compression

Related Fields

Machine LearningDeep LearningNatural Language ProcessingComputer Architecture

Keywords

LLMKV cachecompressioninferencememory bandwidthRoPEsparse attentionlatent spacemodel efficiencytransformer

Academic Context

#Large Language Models#Model Compression#Efficient Inference#Attention Mechanisms#Memory Management

Commercial Potential

Potential Products

Optimized LLM inference enginesOn-device LLM deployment solutionsCloud-based LLM services with reduced costs

Target Industries

TechnologySaaSCloud ComputingAI Development

Use Case Examples

Enabling real-time chatbots on mobile devicesReducing server costs for AI-powered applicationsDeploying larger context window LLMs more efficiently

Competitive Edge

Offers a novel approach to KV cache compression that overcomes limitations of existing methods, particularly those related to RoPE.

Market Opportunity

Massive market for efficient LLM deployment.

Revenue Models

Licensing of the SALS technologyintegration into inference platforms.

Resource Requirements

Compute Needs

Primarily focused on inference hardware, aiming to reduce requirements.

Data Requirements

N/A (focus on model architecture and inference)

Deployment Constraints

Integration with existing LLM inference pipelines.

Scalability

Aims to improve scalability of LLM inference by reducing resource demands.

Production Readiness

Maturity Level

Research

Time to Market

Medium, requires implementation and testing within LLM frameworks.

Patent Potential

Moderate, if the SALS method is patented.

View Full Paper Back to Papers