Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Extending CLIP models to semantic segmentation remains challenging due to the
misalignment between their image-level pre-training objectives and the
pixel-level visual understanding required for dense prediction. While prior
efforts have achieved encouraging results by reorganizing the final layer and
features, they often inherit the global alignment bias of preceding layers,
leading to suboptimal segmentation performance. In this work, we propose
LHT-CLIP, a novel training-free framework that systematically exploits the
visual discriminability of CLIP across layer, head, and token levels. Through
comprehensive analysis, we reveal three key insights: (i) the final layers
primarily strengthen image-text alignment with sacrifice of visual
discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14),
partly due to the emergence of anomalous tokens; (ii) a subset of attention
heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual
discriminability across datasets; (iii) abnormal tokens display sparse and
consistent activation pattern compared to normal tokens. Based on these
findings, we propose three complementary techniques: semantic-spatial
reweighting, selective head enhancement, and abnormal token replacement to
effectively restore visual discriminability and improve segmentation
performance without any additional training, auxiliary pre-trained networks, or
extensive hyperparameter tuning. Extensive experiments on 8 common semantic
segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art
performance across diverse scenarios, highlighting its effectiveness and
practicality for real-world deployment.
Authors (3)
Jinxin Zhou
Jiachen Jiang
Zhihui Zhu
Submitted
October 27, 2025
Key Contributions
LHT-CLIP is a novel training-free framework that systematically exploits CLIP's visual discriminability across layers, heads, and tokens to improve open-vocabulary semantic segmentation. It reveals key insights into how CLIP's internal representations evolve, addressing the misalignment issue for dense prediction tasks.
Business Value
Enables more flexible and accurate image segmentation for applications like autonomous driving, medical image analysis, and content moderation, without requiring task-specific training data.