Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent years have witnessed an increasing interest in image-text contrastive
modeling, exemplified by models such as Contrastive Language-Image Pretraining
(CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational
framework that converts connection weights of both vision and text encoders of
CLIP into the ternary format, instead of full-precision or floating ones.
TernaryCLIP incorporates quantization-aware training and distillation modules,
preventing precision degradation and enabling low-cost and high-efficiency
computations. Comprehensive experiments demonstrate that TernaryCLIP can
achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98
$\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$
storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while
maintaining promising performance on zero-shot image classification and
image-text retrieval tasks across 41 commonly used datasets. Our work
highlights the feasibility of extreme quantization for large multimodal models,
supporting effective and efficient deployment on resource-constrained devices.
The model and code can be accessed from Hugging Face and GitHub.
Authors (8)
Shu-Hao Zhang
Wei-Cheng Tang
Chen Wu
Peng Hu
Nan Li
Liang-Jie Zhang
+2 more
Submitted
October 23, 2025
Key Contributions
TernaryCLIP proposes an efficient framework for compressing vision-language models by converting weights to ternary format. This approach significantly reduces model size and inference time while maintaining performance, making large models more accessible and deployable.
Business Value
Enables deployment of powerful vision-language models on resource-constrained devices, reducing operational costs for AI services and allowing for real-time applications.