Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This paper presents a novel training-free framework for open-vocabulary image
segmentation and object recognition (OVSR), which leverages EfficientNetB0, a
convolutional neural network, for unsupervised segmentation and CLIP, a
vision-language model, for open-vocabulary object recognition. The proposed
framework adopts a two stage pipeline: unsupervised image segmentation followed
by segment-level recognition via vision-language alignment. In the first stage,
pixel-wise features extracted from EfficientNetB0 are decomposed using singular
value decomposition to obtain latent representations, which are then clustered
using hierarchical clustering to segment semantically meaningful regions. The
number of clusters is adaptively determined by the distribution of singular
values. In the second stage, the segmented regions are localized and encoded
into image embeddings using the Vision Transformer backbone of CLIP. Text
embeddings are precomputed using CLIP's text encoder from category-specific
prompts, including a generic something else prompt to support open set
recognition. The image and text embeddings are concatenated and projected into
a shared latent feature space via SVD to enhance cross-modal alignment.
Recognition is performed by computing the softmax over the similarities between
the projected image and text embeddings. The proposed method is evaluated on
standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving
state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and
F1-score. These results demonstrate the effectiveness, flexibility, and
generalizability of the proposed framework.
Submitted
October 22, 2025
Key Contributions
This paper introduces a novel training-free framework for open-vocabulary image segmentation and recognition (OVSR) by combining EfficientNetB0 for unsupervised segmentation and CLIP for recognition. The framework uses SVD and hierarchical clustering for segmentation and CLIP's vision-language alignment for recognition, enabling semantic region segmentation and recognition without requiring task-specific training data.
Business Value
Enables flexible and adaptable image analysis systems that can recognize and segment objects based on natural language descriptions, reducing the need for costly data annotation and model retraining for new categories.