arxiv_cv 95% Match Research Paper AI researchers,ML engineers,Developers of multimodal AI systems,Researchers focused on AI efficiency 20 hours ago

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

large-language-models › multimodal-llms

📄 Abstract

Abstract: Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

Key Contributions

Introduces UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. It standardizes protocols across six ability dimensions and ten datasets, evaluating ten compression algorithms on three LMM families, and incorporates system-level metrics beyond task accuracy.

Business Value

Enables more efficient deployment of large multimodal models by identifying optimal compression strategies, reducing inference costs and improving user experience through faster response times.

Paper Metadata

Innovation Type

Benchmark and Evaluation Framework

Deployment Feasibility

The benchmark itself is a research tool, but its findings directly inform the practical deployment of compressed LMMs.

Limitations Addressed

Fragmented and inconsistent evaluation of visual token compression methods for LMMs, hindering fair comparison and understanding of their true effectiveness.

Performance Gains

Provides a standardized way to evaluate and compare visual token compression methods, uncovering key findings like random pruning being a strong baseline and no single method consistently outperforming others.

Technical Tags

visual token compressionLarge Multimodal Models (LMMs)benchmarkpruningmerginginference efficiencyruntimelatencysystem-level metricsUniPruneBench

Research Topics

Multimodal AILarge Language ModelsModel CompressionInference OptimizationAI Benchmarking

Methods & Architectures

Unified benchmarking framework (UniPruneBench)Visual token pruningVisual token mergingSystem-level metric evaluation (runtime, latency) LLaVA-v1.5Intern-VL3Qwen2.5-VL

Applications & Tasks

AI Inference Optimization Multimodal AI Deployment Inference EfficiencyModel CompressionBenchmarking Evaluating visual token compression methods for LMMsComparing compression algorithms across different LMMs and datasetsMeasuring impact on accuracy and system performance

Datasets & Benchmarks

Benchmarks

UniPruneBench

Task accuracyRuntimePrefilling latencySystem-level metrics

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingComputer Vision

Keywords

multimodal LLMtoken compressionpruninginference efficiencybenchmarkUniPruneBenchvisual tokensLMMlatencyruntimemodel optimization

Academic Context

#Multimodal AI#Large Language Models#Model Compression#Inference Optimization#AI Benchmarking

Commercial Potential

Potential Products

Optimized LMM inference enginesTools for selecting compression strategies for LMMs

Target Industries

TechnologyCloud ComputingSaaSAI Development

Use Case Examples

Reducing the cost of deploying multimodal chatbotsEnabling faster image-to-text generationMaking LMMs accessible on resource-constrained devices

Competitive Edge

Establishes a unified benchmark to address the lack of standardization in evaluating visual token compression for LMMs, providing a comprehensive comparison of methods and models.

Market Opportunity

Massive and rapidly growing market for large language and multimodal models.

Revenue Models

Licensing of benchmark frameworkconsulting services for model optimization.

Resource Requirements

Compute Needs

High (for running comprehensive benchmarks)

Data Requirements

Diverse datasets covering various multimodal tasks (e.g., VQA, image captioning).

Deployment Constraints

Compression can sometimes lead to a drop in accuracy, requiring a trade-off.

Scalability

The benchmark is designed to be extensible to new models, datasets, and compression algorithms.

Regulatory Considerations

None directly.

Production Readiness

Maturity Level

Research/Benchmark Development

Time to Market

N/A (benchmark development)

Patent Potential

Low (focus on benchmark methodology)

View Full Paper Back to Papers