arxiv_cv 94% Match Research Paper Computer Vision Researchers,ML Engineers,Robotics Engineers,AI Scientists 1 week ago

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

computer-vision › scene-understanding

📄 Abstract

Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

Authors (3)

Jinxin Zhou

Jiachen Jiang

Zhihui Zhu

Submitted

October 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

LHT-CLIP is a novel training-free framework that systematically exploits CLIP's visual discriminability across layers, heads, and tokens to improve open-vocabulary semantic segmentation. It reveals key insights into how CLIP's internal representations evolve, addressing the misalignment issue for dense prediction tasks.

Business Value

Enables more flexible and accurate image segmentation for applications like autonomous driving, medical image analysis, and content moderation, without requiring task-specific training data.

Paper Metadata

Innovation Type

Methodology/Framework

Deployment Feasibility

High, as it is a training-free framework that leverages pre-trained CLIP models, making it adaptable to various segmentation tasks.

Limitations Addressed

Prior methods using CLIP for segmentation inherit global alignment biases and sacrifice visual discriminability. LHT-CLIP enhances pixel-level understanding by focusing on discriminative features within CLIP.

Performance Gains

Achieves encouraging results by systematically exploiting visual discriminability, leading to improved segmentation performance.

Technical Tags

semantic segmentationCLIPopen-vocabularytraining-freevisual discriminabilitylayer analysishead analysistoken analysisalignment biaspixel-level understanding

Research Topics

Open-Vocabulary SegmentationVision-Language ModelsZero-Shot LearningComputer VisionModel Interpretability

Methods & Architectures

LHT-CLIP frameworkExploiting visual discriminabilityLayer, head, and token level analysisTraining-free segmentation CLIP (Contrastive Language-Image Pre-training)Vision Transformers (ViT)

Applications & Tasks

Image Analysis Computer Vision Robotics Autonomous Systems Medical Imaging Misalignment between image-level pre-training and pixel-level tasksGlobal alignment bias in CLIP for segmentationSuboptimal segmentation performance Open-vocabulary semantic segmentationImproving segmentation accuracyLeveraging CLIP for dense prediction tasks

Related Fields

Computer VisionNatural Language ProcessingMachine LearningVision-Language ModelsZero-Shot Learning

Keywords

semantic segmentationCLIPopen-vocabularytraining-freevision-languageViTpixel-leveldiscriminabilityzero-shotimage analysis

Academic Context

#Open-Vocabulary Segmentation#Vision-Language Models#Zero-Shot Learning#Computer Vision#Model Interpretability

Commercial Potential

Potential Products

General-purpose open-vocabulary segmentation toolsModules for enhancing vision systems in roboticsAutomated image annotation services

Target Industries

TechnologyAutomotiveHealthcareRoboticsSecurity

Use Case Examples

Segmenting objects in images based on arbitrary text descriptionsEnabling robots to identify and interact with novel objectsAutomating the annotation of images for training other models

Competitive Edge

Improves upon existing CLIP-based segmentation methods by systematically analyzing and leveraging internal model representations for better visual discriminability, offering a more robust training-free solution.

Market Opportunity

Large and growing market for computer vision solutions.

Revenue Models

Licensing the LHT-CLIP framework or integrating it into AI vision platforms.

Resource Requirements

Compute Needs

Moderate, as it requires running inference on CLIP models and performing analysis across layers/heads.

Data Requirements

Does not require task-specific training datasets; relies on pre-trained CLIP models.

Deployment Constraints

Performance depends on the quality of the underlying CLIP model; potential for computational overhead during analysis.

Scalability

Scalability is tied to the efficiency of CLIP inference and the analysis techniques employed.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into vision systems.

Patent Potential

Moderate, for the specific analysis techniques and framework for exploiting discriminability.

View Full Paper Back to Papers