Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 94% Match Research Paper Computer Vision Researchers,ML Engineers,Robotics Engineers,AI Scientists 1 week ago

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

computer-vision › scene-understanding
📄 Abstract

Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.
Authors (3)
Jinxin Zhou
Jiachen Jiang
Zhihui Zhu
Submitted
October 27, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

LHT-CLIP is a novel training-free framework that systematically exploits CLIP's visual discriminability across layers, heads, and tokens to improve open-vocabulary semantic segmentation. It reveals key insights into how CLIP's internal representations evolve, addressing the misalignment issue for dense prediction tasks.

Business Value

Enables more flexible and accurate image segmentation for applications like autonomous driving, medical image analysis, and content moderation, without requiring task-specific training data.