Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable
capabilities in processing extended multi-modal sequences, yet the resulting
key-value (KV) cache expansion creates a critical memory bottleneck that
fundamentally limits deployment scalability. While existing KV cache
compression methods focus on retaining high-importance KV pairs to minimize
storage, they often overlook the modality-specific semantic redundancy patterns
that emerge distinctively in multi-modal KV caches. In this work, we first
analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying
levels of redundancy across attention heads. We show that relying solely on
importance can only cover a subset of the full KV cache information
distribution, leading to potential loss of semantic coverage. To address this,
we propose \texttt{MixKV}, a novel method that mixes importance with diversity
for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise
semantic redundancy, selectively balancing diversity and importance when
compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV}
consistently enhances existing methods across multiple LVLMs. Under extreme
compression (budget=64), \texttt{MixKV} improves baseline methods by an average
of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves
remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on
GUI grounding tasks, all while maintaining comparable inference efficiency.
Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable
performance gains. Our code is available at
\href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
Authors (4)
Xuyang Liu
Xiyan Gui
Yuchao Zhang
Linfeng Zhang
Submitted
October 23, 2025
Key Contributions
This paper introduces MixKV, a novel method for KV cache compression in LVLMs that jointly optimizes for importance and diversity. It addresses the limitations of existing importance-only methods by analyzing and leveraging modality-specific semantic redundancy across attention heads, leading to better semantic coverage and improved compression.
Business Value
Significantly reduces the memory footprint of large vision-language models, making them more feasible to deploy on edge devices or in resource-constrained environments, thereby expanding their practical applications.