Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 93% Match Research Paper Computer vision researchers,NLP researchers,AI developers,Image editing software developers 1 week ago

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

computer-vision › scene-understanding
📄 Abstract

Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.
Authors (6)
Shiting Xiao
Rishabh Kabra
Yuhang Li
Donghyun Lee
Joao Carreira
Priyadarshini Panda
Submitted
July 7, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

OpenWorldSAM extends SAM2 for open-vocabulary segmentation using language prompts by integrating multi-modal embeddings from a lightweight VLM. It achieves efficiency by freezing pre-trained models and training minimal parameters, enabling flexible category-level and sentence-level prompting for diverse segmentation tasks.

Business Value

Enables more intuitive and flexible image analysis tools, allowing users to segment objects using natural language, which can be applied in creative tools, content management, and assistive technologies.