Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-language models such as CLIP have recently propelled open-vocabulary
dense prediction tasks by enabling recognition of a broad range of visual
concepts. However, CLIP still struggles with fine-grained, region-level
understanding, hindering its effectiveness on these dense prediction tasks. We
identify two pivotal factors required to address this limitation: semantic
coherence and fine-grained vision-language alignment. Current adaptation
methods often improve fine-grained alignment at the expense of semantic
coherence, and often rely on extra modules or supervised fine-tuning. To
overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel
approach that simultaneously enhances semantic coherence and fine-grained
alignment by leveraging own knowledge of a model across all representation
levels. Unlike prior methods, ATAS uses only unlabeled images and an internal
self-distillation process to refine representations of CLIP vision encoders,
preserving local semantic consistency while sharpening local detail
recognition. On open-vocabulary object detection and semantic segmentation
benchmarks, ATAS achieves substantial performance gains, outperforming baseline
CLIP models. These results validate the effectiveness of our approach and
underscore the importance of jointly maintaining semantic coherence and
fine-grained alignment for advanced open-vocabulary dense prediction.
Key Contributions
This paper proposes ATAS (Any-to-Any Self-Distillation), a novel approach that enhances both semantic coherence and fine-grained alignment in vision-language models like CLIP for open-vocabulary dense prediction tasks. ATAS uses only unlabeled images and an internal self-distillation process across all representation levels, avoiding extra modules or supervised fine-tuning.
Business Value
Enables AI systems to understand and label objects and regions in images with greater accuracy and flexibility, even for concepts not explicitly seen during training, which is valuable for content moderation, image search, and robotics.