arxiv_cv 95% Match Research Paper Researchers in LLMs and multimodal AI,ML engineers optimizing model performance,Developers of AI applications requiring visual understanding 1 week ago

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

Authors (4)

Jinhong Deng

Wen Li

Joey Tianyi Zhou

Yang He

Submitted

October 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper proposes SCOPE, a novel visual token pruning strategy for efficient MLLMs that jointly considers token saliency and coverage. Unlike previous methods focusing solely on saliency, SCOPE aims to maintain semantic completeness by selecting tokens that offer both importance and broad coverage of visual information.

Business Value

SCOPE enables the deployment of more powerful MLLMs on resource-constrained devices and reduces operational costs for large-scale AI services, making advanced multimodal AI more accessible and practical for businesses.

Paper Metadata

Innovation Type

Pruning Strategy

Deployment Feasibility

Highly feasible, as it's a method applied during the model's forward pass to reduce computation, requiring minimal changes to existing MLLM architectures.

Limitations Addressed

High computational cost of processing numerous visual tokens in MLLMs,Semantic incompleteness resulting from pruning methods that only consider token saliency

Performance Gains

Reduced computational overhead,Improved efficiency,Maintained or improved performance on downstream tasks

Technical Tags

multimodal LLMstoken pruningefficiencysaliencycoveragesemantic completenessset cover problemcomputational overheadvisual tokens

Research Topics

Multimodal AILarge Language ModelsModel efficiencyComputer VisionNatural Language Processing

Methods & Architectures

SCOPE (Saliency-Coverage Oriented token Pruning)Set-coverage metricToken-coverage gainJoint saliency and coverage optimization Multimodal Large Language Models (MLLMs)

Applications & Tasks

Image captioning Visual question answering Multimodal reasoning AI assistants High computational overhead in MLLMsRedundant visual tokensSemantic incompleteness from saliency-based pruningBalancing efficiency and performance Efficiently pruning visual tokens in MLLMsPreserving semantic completenessReducing computational cost

Related Fields

Large Language ModelsMultimodal AIComputer VisionModel CompressionDeep Learning EfficiencyNatural Language Processing

Keywords

multimodal LLMtoken pruningefficiencysaliencycoveragesemantic completenessvisual tokenscomputational costmodel compressiondeep learningattention mechanismlarge language modelscomputer vision

Academic Context

#Multimodal AI#Large Language Models#Model efficiency#Computer Vision#Natural Language Processing

Technology Stack

Frameworks & Libraries

PyTorchHugging Face Transformers

Programming Languages

Python

Commercial Potential

Potential Products

Efficient MLLM inference enginesOn-device multimodal AI solutionsOptimized AI assistants

Target Industries

TechnologySoftware DevelopmentMobile ComputingCloud Services

Use Case Examples

Enabling real-time image captioning on mobile devicesReducing latency for visual question answering systemsDeploying complex multimodal reasoning models in edge computing environments

Competitive Edge

Addresses the critical issue of computational overhead in MLLMs by introducing a novel pruning strategy that balances saliency and coverage, aiming for better semantic completeness than purely saliency-driven methods.

Market Opportunity

Very large, driven by the rapid growth of LLMs and multimodal AI.

Revenue Models

Licensing of efficient MLLM componentsintegration into AI platforms.

Resource Requirements

Compute Needs

Reduces compute requirements during inference.

Data Requirements

Requires multimodal datasets for training and evaluating the MLLM.

Deployment Constraints

Compatibility with existing MLLM architectures.

Scalability

Enables scaling of MLLMs to more complex tasks and wider deployment scenarios by reducing computational demands.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years (for integration into MLLM frameworks)

Patent Potential

Moderate for the novel SCOPE pruning strategy.

View Full Paper Back to Papers