Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Cross-modal alignment aims to map heterogeneous modalities into a shared
latent space, as exemplified by models like CLIP, which benefit from
large-scale image-text pretraining for strong recognition capabilities.
However, when operating in resource-constrained settings with limited or
low-quality data, these models often suffer from overconfidence and degraded
performance due to the prevalence of ambiguous or weakly correlated image-text
pairs. Current contrastive learning approaches, which rely on single positive
pairs, further exacerbate this issue by reinforcing overconfidence on uncertain
samples. To address these challenges, we propose Modest-Align, a lightweight
alignment framework designed for robustness and efficiency. Our approach
leverages two complementary strategies -- Random Perturbation, which introduces
controlled noise to simulate uncertainty, and Embedding Smoothing, which
calibrates similarity distributions in the embedding space. These mechanisms
collectively reduce overconfidence and improve performance on noisy or weakly
aligned samples. Extensive experiments across multiple benchmark datasets
demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval
tasks, achieving competitive results with over 100x less training data and 600x
less GPU time than CLIP. Our method offers a practical and scalable solution
for cross-modal alignment in real-world, low-resource scenarios.
Authors (6)
Jiaxiang Liu
Yuan Wang
Jiawei Du
Joey Tianyi Zhou
Mingkun Xu
Zuozhu Liu
Submitted
October 24, 2025
Key Contributions
Modest-Align is a lightweight, data-efficient framework for aligning vision and language models, particularly in resource-constrained settings. It addresses overconfidence and degraded performance by employing Random Perturbation to simulate uncertainty and Embedding Smoothing to calibrate similarity distributions, improving robustness without requiring large datasets.
Business Value
Enables the development of effective vision-language applications even with smaller datasets or in environments with limited data availability, reducing development costs and time.