Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper ML Researchers,Computer Vision Engineers,NLP Researchers 2 weeks ago

Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity

large-language-models › multimodal-llms
📄 Abstract

Abstract: In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.
Authors (6)
Naoki Yoshida
Satoshi Hayakawa
Yuhta Takida
Toshimitsu Uesaka
Hiromi Wakaki
Yuki Mitsufuji
Submitted
October 17, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper proposes KME-CLIP, a novel enhancement to multi-modal contrastive pretraining frameworks like CLIP. It leverages the linear structure of Pointwise Mutual Information (PMI) in a reproducing kernel Hilbert space to approximate PMI with arbitrary accuracy, addressing the limitation of current CLIP implementations that fail to fully utilize this structure. This theoretical refinement leads to empirical improvements in retrieval and classification tasks.

Business Value

Enhancing the performance of multi-modal AI systems can lead to more accurate search engines, better content recommendation systems, and improved image captioning or visual question answering applications.