arxiv_ai 70% Match Research Paper ML Researchers,Computer Vision Engineers,NLP Engineers,Hardware Engineers 1 week ago

TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

computer-vision › diffusion-models

📄 Abstract

Abstract: Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.

Authors (8)

Shu-Hao Zhang

Wei-Cheng Tang

Chen Wu

Peng Hu

Nan Li

Liang-Jie Zhang

+2 more

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

TernaryCLIP proposes an efficient framework for compressing vision-language models by converting weights to ternary format. This approach significantly reduces model size and inference time while maintaining performance, making large models more accessible and deployable.

Business Value

Enables deployment of powerful vision-language models on resource-constrained devices, reducing operational costs for AI services and allowing for real-time applications.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, due to significant reductions in model size and computational requirements.

Limitations Addressed

High computational cost and large storage requirements of existing vision-language models.

Performance Gains

Up to 99% ternarized weights, 1.58-bit representation, 16.98x compression ratio, 2.3x inference acceleration, 16x storage reduction, 10x memory optimization, 60% sparsity.

Technical Tags

quantizationternary weightsknowledge distillationmodel compressionvision-language modelszero-shot learninginference accelerationstorage reductionmemory optimizationsparsity

Research Topics

Efficient Deep LearningModel CompressionVision-Language UnderstandingQuantization TechniquesKnowledge Distillation

Methods & Architectures

Quantization-aware trainingKnowledge distillationTernary weight conversion CLIPVision EncoderText Encoder

Applications & Tasks

Image Classification Image-Text Retrieval Model CompressionEfficiency ImprovementReducing Computational Cost Zero-shot image classificationImage-text retrieval

Datasets & Benchmarks

Datasets

41 commonly used datasets

Compression ratioInference accelerationStorage reductionMemory optimizationSparsity

Related Fields

Deep LearningComputer VisionNatural Language ProcessingModel CompressionQuantization

Keywords

TernaryCLIPVision-Language ModelsModel CompressionQuantizationKnowledge DistillationEfficient AICLIPZero-shot LearningImage ClassificationImage-Text RetrievalLow-bit RepresentationInference SpeedStorage Efficiency

Academic Context

#Efficient Deep Learning#Model Compression#Vision-Language Understanding#Quantization Techniques#Knowledge Distillation

Commercial Potential

Potential Products

Compressed CLIP models for edge devicesEfficient image search enginesReal-time visual question answering systems

Target Industries

Mobile TechnologyE-commerceAutonomous SystemsContent Moderation

Use Case Examples

On-device image recognitionFaster image-text matching for searchReduced bandwidth for visual data processing

Competitive Edge

Offers a more aggressive compression strategy (ternary weights) compared to standard quantization methods, achieving higher compression ratios and inference speedups.

Market Opportunity

Growing market for efficient AI models in edge computing and mobile applications.

Revenue Models

Licensing of compressed modelsservices for model compression.

Resource Requirements

Compute Needs

Reduced compared to full-precision models.

Data Requirements

Requires large-scale image-text datasets for training and fine-tuning.

Deployment Constraints

Potential for slight accuracy degradation compared to full-precision models, though mitigated by distillation.

Scalability

The compression techniques are inherently scalable to larger models.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for widespread adoption in commercial products.

Patent Potential

Moderate, related to novel quantization and distillation techniques.

View Full Paper Back to Papers