arxiv_cv 95% Match Research Paper AI researchers,Machine learning engineers,Developers deploying large models,Systems architects 2 weeks ago

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

large-language-models › model-architecture

📄 Abstract

Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.

Authors (4)

Xuyang Liu

Xiyan Gui

Yuchao Zhang

Linfeng Zhang

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper introduces MixKV, a novel method for KV cache compression in LVLMs that jointly optimizes for importance and diversity. It addresses the limitations of existing importance-only methods by analyzing and leveraging modality-specific semantic redundancy across attention heads, leading to better semantic coverage and improved compression.

Business Value

Significantly reduces the memory footprint of large vision-language models, making them more feasible to deploy on edge devices or in resource-constrained environments, thereby expanding their practical applications.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High. This method directly addresses a key deployment challenge (memory) and aims to improve efficiency without sacrificing performance.

Limitations Addressed

Memory bottleneck in LVLMs,Limited deployment scalability,Loss of semantic information when compressing KV cache based solely on importance,Ignoring modality-specific redundancy patterns

Technical Tags

KV cache compressionlarge vision-language models (LVLMs)memory bottleneckdeployment scalabilitysemantic redundancyattention headsimportance vs diversitymodality-specific patternstransformer modelsoptimization

Research Topics

Model CompressionLarge Language ModelsVision-Language ModelsEfficient AITransformer Architectures

Methods & Architectures

KV Cache CompressionImportance-based SelectionDiversity-based SelectionJoint OptimizationHead-wise Semantic Redundancy Analysis TransformerLarge Vision-Language Models (LVLMs)

Applications & Tasks

Natural Language Processing Computer Vision AI Deployment Edge Computing Memory bottleneck in LVLMs due to KV cache expansionLimited deployment scalabilityLoss of semantic coverage by relying solely on importanceIgnoring modality-specific semantic redundancy KV Cache CompressionEfficient Inference of LVLMsEnabling deployment of large models on resource-constrained devices

Related Fields

Machine LearningDeep LearningComputer VisionNatural Language ProcessingSystems Engineering

Keywords

KV cachecompressionlarge vision-language modelsLVLMmemory bottleneckscalabilitysemantic redundancyattentionimportancediversityoptimizationtransformerinferencedeployment

Academic Context

#Model Compression#Large Language Models#Vision-Language Models#Efficient AI#Transformer Architectures

Commercial Potential

Potential Products

Optimized inference engines for LVLMsLibraries for efficient KV cache managementDeployment solutions for resource-constrained environments

Target Industries

TechnologyCloud ComputingMobile DevicesAutomotive (In-car AI)

Use Case Examples

Enabling real-time multimodal AI on smartphonesDeploying complex vision-language models in autonomous vehiclesReducing server costs for large-scale AI services

Competitive Edge

Offers a more effective KV cache compression strategy than existing importance-based methods by incorporating diversity and analyzing modality-specific redundancy, aiming for better performance retention at higher compression ratios.

Market Opportunity

Large and growing market for efficient deployment of large AI models.

Revenue Models

Licensing of the compression technologyintegration into AI platforms and hardware accelerators.

Resource Requirements

Compute Needs

Moderate for the compression algorithm itself, but enables running larger models with less compute during inference.

Data Requirements

Requires access to large vision-language models and their KV caches during training/analysis.

Deployment Constraints

Integration complexity with existing model architectures, potential overhead of the compression algorithm.

Scalability

The method is designed to improve scalability by reducing memory requirements.

Regulatory Considerations

None directlybut related to efficiency improvements for AI systems.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years, depending on integration into existing frameworks and libraries.

Patent Potential

Moderate, related to the novel joint optimization strategy for KV cache compression.

View Full Paper Back to Papers