Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-language models (VLMs) achieve remarkable performance through
large-scale image-text pretraining. However, their reliance on labeled image
datasets limits scalability and leaves vast amounts of unlabeled image data
underutilized. To address this, we propose Unified Vision-Language Alignment
for Zero-Label Enhancement (ViZer), an enhancement training framework that
enables zero-label learning in image captioning, providing a practical starting
point for broader zero-label adaptation in vision-language tasks. Unlike prior
approaches that rely on human or synthetically annotated datasets, ViZer
actively aligns vision and language representation features during training,
enabling existing VLMs to generate improved captions without requiring text
labels or full retraining. We demonstrate ViZer's advantage in qualitative
evaluation, as automated caption metrics such as CIDEr and BERTScore often
penalize details that are absent in reference captions. Applying ViZer on
SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements,
producing captions that are more grounded and descriptive than their baseline.
Authors (6)
Sanghyun Byun
Jung Ick Guack
Mohanad Odema
Baisub Lee
Jacob Song
Woo Seong Chung
Submitted
October 14, 2025
Key Contributions
ViZer introduces a novel enhancement training framework for zero-label image captioning, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. This approach addresses the limitations of labeled datasets and underutilized unlabeled data by actively aligning vision and language representations during training, offering a practical starting point for broader zero-label adaptation in vision-language tasks.
Business Value
Enables more efficient and scalable development of image captioning systems by leveraging readily available unlabeled image data. This can lead to improved content understanding and generation for various applications like digital asset management and accessibility tools.