Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As large-scale foundation models trained on billions of image--mask pairs
covering a vast diversity of scenes, objects, and contexts, SAM and its
upgraded version, SAM~2, have significantly influenced multiple fields within
computer vision. Leveraging such unprecedented data diversity, they exhibit
strong open-world segmentation capabilities, with SAM~2 further enhancing these
capabilities to support high-quality video segmentation. While SAMs (SAM and
SAM~2) have demonstrated excellent performance in segmenting
context-independent concepts like people, cars, and roads, they overlook more
challenging context-dependent (CD) concepts, such as visual saliency,
camouflage, industrial defects, and medical lesions. CD concepts rely heavily
on global and local contextual information, making them susceptible to shifts
in different contexts, which requires strong discriminative capabilities from
the model. The lack of comprehensive evaluation of SAMs limits understanding of
their performance boundaries, which may hinder the design of future models. In
this paper, we conduct a thorough evaluation of SAMs on 11 CD concepts across
2D and 3D images and videos in various visual modalities within natural,
medical, and industrial scenes. We develop a unified evaluation framework for
SAM and SAM~2 that supports manual, automatic, and intermediate self-prompting,
aided by our specific prompt generation and interaction strategies. We further
explore the potential of SAM~2 for in-context learning and introduce prompt
robustness testing to simulate real-world imperfect prompts. Finally, we
analyze the benefits and limitations of SAMs in understanding CD concepts and
discuss their future development in segmentation tasks.