Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of
visual tokens, leading to considerable computational overhead, even though many
of these tokens are redundant. Existing visual token pruning methods primarily
focus on selecting the most salient tokens based on attention scores, resulting
in the semantic incompleteness of the selected tokens. In this paper, we
propose a novel visual token pruning strategy, called
\textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing
for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and
coverage of the selected visual tokens to better preserve semantic
completeness. Specifically, we introduce a set-coverage for a given set of
selected tokens, computed based on the token relationships. We then define a
token-coverage gain for each unselected token, quantifying how much additional
coverage would be obtained by including it. By integrating the saliency score
into the token-coverage gain, we propose our SCOPE score and iteratively select
the token with the highest SCOPE score. We conduct extensive experiments on
multiple vision-language understanding benchmarks using the LLaVA-1.5 and
LLaVA-Next models. Experimental results demonstrate that our method
consistently outperforms prior approaches. Our code is available at
\href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.
Authors (4)
Jinhong Deng
Wen Li
Joey Tianyi Zhou
Yang He
Submitted
October 28, 2025
Key Contributions
This paper proposes SCOPE, a novel visual token pruning strategy for efficient MLLMs that jointly considers token saliency and coverage. Unlike previous methods focusing solely on saliency, SCOPE aims to maintain semantic completeness by selecting tokens that offer both importance and broad coverage of visual information.
Business Value
SCOPE enables the deployment of more powerful MLLMs on resource-constrained devices and reduces operational costs for large-scale AI services, making advanced multimodal AI more accessible and practical for businesses.