Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The ability to segment objects based on open-ended language prompts remains a
critical challenge, requiring models to ground textual semantics into precise
spatial masks while handling diverse and unseen categories. We present
OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model
v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings
extracted from a lightweight vision-language model (VLM). Our approach is
guided by four key principles: i) Unified prompting: OpenWorldSAM supports a
diverse range of prompts, including category-level and sentence-level language
descriptions, providing a flexible interface for various segmentation tasks.
ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we
train only 4.5 million parameters on the COCO-stuff dataset, achieving
remarkable resource efficiency. iii) Instance Awareness: We enhance the model's
spatial understanding through novel positional tie-breaker embeddings and
cross-attention layers, enabling effective segmentation of multiple instances.
iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities,
generalizing well on unseen categories and an open vocabulary of concepts
without additional training. Extensive experiments demonstrate that
OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic,
instance, and panoptic segmentation across multiple benchmarks. Code is
available at https://github.com/GinnyXiao/OpenWorldSAM.
Authors (6)
Shiting Xiao
Rishabh Kabra
Yuhang Li
Donghyun Lee
Joao Carreira
Priyadarshini Panda
Key Contributions
OpenWorldSAM extends SAM2 for open-vocabulary segmentation using language prompts by integrating multi-modal embeddings from a lightweight VLM. It achieves efficiency by freezing pre-trained models and training minimal parameters, enabling flexible category-level and sentence-level prompting for diverse segmentation tasks.
Business Value
Enables more intuitive and flexible image analysis tools, allowing users to segment objects using natural language, which can be applied in creative tools, content management, and assistive technologies.