Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI Researchers,Machine Learning Engineers,Computer Vision Practitioners 6 days ago

DRIP: Dynamic patch Reduction via Interpretable Pooling

computer-vision › object-detection
📄 Abstract

Abstract: Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
Authors (2)
Yusen Peng
Sachin Kumar
Submitted
October 29, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Proposes DRIP (Dynamic patch Reduction via Interpretable Pooling), a method to dynamically merge tokens in deeper layers of visual encoders, significantly reducing GFLOPs during pretraining while maintaining comparable classification and zero-shot performance. This addresses the efficiency concerns hindering the pretraining of large vision-language models from scratch.

Business Value

Enables faster and more cost-effective training of powerful multimodal AI models, democratizing access to advanced AI capabilities and accelerating research in various domains, including scientific discovery.