Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large multimodal models (LMMs) often suffer from severe inference
inefficiency due to the large number of visual tokens introduced by image
encoders. While recent token compression methods, such as pruning and merging,
have shown promise in reducing redundancy, their evaluation remains fragmented
and inconsistent. In this work, we present UniPruneBench, a unified and
extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench
provides standardized protocols across six ability dimensions and ten datasets,
covering ten representative compression algorithms and three families of LMMs
(LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates
system-level metrics such as runtime and prefilling latency to provide a
holistic view. Our experiments uncover several key findings: (1) random pruning
is a surprisingly strong baseline, (2) no single method consistently
outperforms others across scenarios, (3) pruning sensitivity varies
significantly across tasks, with OCR being most vulnerable, and (4) pruning
ratio is the dominant factor governing performance degradation. We believe
UniPruneBench will serve as a reliable foundation for future research on
efficient multimodal modeling.
Key Contributions
Introduces UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. It standardizes protocols across six ability dimensions and ten datasets, evaluating ten compression algorithms on three LMM families, and incorporates system-level metrics beyond task accuracy.
Business Value
Enables more efficient deployment of large multimodal models by identifying optimal compression strategies, reducing inference costs and improving user experience through faster response times.