Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 70% Match Research Paper ML Researchers,Computer Vision Engineers,NLP Engineers,Hardware Engineers 1 week ago

TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

computer-vision › diffusion-models
📄 Abstract

Abstract: Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.
Authors (8)
Shu-Hao Zhang
Wei-Cheng Tang
Chen Wu
Peng Hu
Nan Li
Liang-Jie Zhang
+2 more
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

TernaryCLIP proposes an efficient framework for compressing vision-language models by converting weights to ternary format. This approach significantly reduces model size and inference time while maintaining performance, making large models more accessible and deployable.

Business Value

Enables deployment of powerful vision-language models on resource-constrained devices, reducing operational costs for AI services and allowing for real-time applications.