arxiv_cv 93% Match Research Paper Computer vision researchers,NLP researchers,AI developers,Image editing software developers 1 week ago

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

computer-vision › scene-understanding

📄 Abstract

Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.

Authors (6)

Shiting Xiao

Rishabh Kabra

Yuhang Li

Donghyun Lee

Joao Carreira

Priyadarshini Panda

Submitted

July 7, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

OpenWorldSAM extends SAM2 for open-vocabulary segmentation using language prompts by integrating multi-modal embeddings from a lightweight VLM. It achieves efficiency by freezing pre-trained models and training minimal parameters, enabling flexible category-level and sentence-level prompting for diverse segmentation tasks.

Business Value

Enables more intuitive and flexible image analysis tools, allowing users to segment objects using natural language, which can be applied in creative tools, content management, and assistive technologies.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High. The focus on efficiency (freezing components, few parameters) suggests good potential for deployment on resource-constrained devices or faster inference.

Limitations Addressed

Inability to segment unseen categories based on language,Limited flexibility in prompting,High computational cost of training large multimodal models,Lack of instance awareness in some segmentation models

Technical Tags

SAM2Open-vocabulary SegmentationLanguage PromptsMultimodal EmbeddingsVision-Language Models (VLMs)Instance SegmentationZero-shot LearningPrompt EngineeringLightweight VLMPositional Embeddings

Research Topics

Open-Vocabulary RecognitionMultimodal LearningImage SegmentationVision-Language IntegrationPrompt-based Learning

Methods & Architectures

Extending SAM2Integrating VLM embeddingsUnified promptingFreezing pre-trained componentsTraining lightweight parametersPositional tie-breaker embeddings SAM2Vision-Language Models (VLMs)

Applications & Tasks

Image Editing Content Moderation Robotics Perception Autonomous Driving Medical Imaging Analysis Open-vocabulary Image SegmentationLanguage-guided SegmentationZero-shot SegmentationHandling unseen categories Segmenting objects based on arbitrary language descriptionsUniversal image segmentation

Datasets & Benchmarks

Datasets

COCO-stuff

Related Fields

Computer VisionNatural Language ProcessingMultimodal AIDeep Learning

Keywords

Image SegmentationOpen-VocabularyLanguage PromptsSAM2VLMMultimodalZero-shotInstance SegmentationPromptingCOCO-stuffEfficient AIUniversal Segmentation

Academic Context

#Open-Vocabulary Recognition#Multimodal Learning#Image Segmentation#Vision-Language Integration#Prompt-based Learning

Commercial Potential

Potential Products

Advanced image editing toolsContent analysis platformsRobotic vision systems

Target Industries

Media and EntertainmentE-commerceRoboticsAutonomous SystemsHealthcare (medical imaging)

Use Case Examples

Selecting and segmenting 'all the red chairs' in an imageIsolating specific objects in complex scenes based on descriptionsAutomated image tagging and organization

Competitive Edge

Extends the capabilities of powerful segmentation models like SAM2 to open-vocabulary settings using language, offering a more flexible and user-friendly interaction paradigm.

Market Opportunity

Large market for image segmentation and multimodal AI applications.

Revenue Models

API accessintegration into existing software suites.

Resource Requirements

Compute Needs

Low to Moderate (due to efficient design)

Data Requirements

Large-scale image datasets with diverse objects (e.g., COCO-stuff).

Scalability

Designed for efficiency, suggesting good scalability.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Low (likely building on existing architectures)

View Full Paper Back to Papers