Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper ML Researchers,Computer Vision Engineers,NLP Engineers 3 weeks ago

Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

large-language-models › multimodal-llms
📄 Abstract

Abstract: Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.
Authors (6)
Sanghyun Byun
Jung Ick Guack
Mohanad Odema
Baisub Lee
Jacob Song
Woo Seong Chung
Submitted
October 14, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

ViZer introduces a novel enhancement training framework for zero-label image captioning, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. This approach addresses the limitations of labeled datasets and underutilized unlabeled data by actively aligning vision and language representations during training, offering a practical starting point for broader zero-label adaptation in vision-language tasks.

Business Value

Enables more efficient and scalable development of image captioning systems by leveraging readily available unlabeled image data. This can lead to improved content understanding and generation for various applications like digital asset management and accessibility tools.