Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI researchers,Machine learning engineers,Developers deploying large models,Systems architects 2 weeks ago

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

large-language-models › model-architecture
📄 Abstract

Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
Authors (4)
Xuyang Liu
Xiyan Gui
Yuchao Zhang
Linfeng Zhang
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper introduces MixKV, a novel method for KV cache compression in LVLMs that jointly optimizes for importance and diversity. It addresses the limitations of existing importance-only methods by analyzing and leveraging modality-specific semantic redundancy across attention heads, leading to better semantic coverage and improved compression.

Business Value

Significantly reduces the memory footprint of large vision-language models, making them more feasible to deploy on edge devices or in resource-constrained environments, thereby expanding their practical applications.