Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 93% Match Research Paper Computer Vision Researchers,Robotics Engineers,AI Engineers,Video Analysis Specialists 2 weeks ago

SAM 2++: Tracking Anything at Any Granularity

computer-vision › video-understanding
📄 Abstract

Abstract: Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
Authors (10)
Jiaming Zhang
Cheng Liang
Yichun Yang
Chenkai Zeng
Yutao Cui
Xinwen Zhang
+4 more
Submitted
October 21, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

SAM 2++ is a unified model for video tracking that handles targets at any granularity (masks, boxes, points). It introduces task-specific prompts for diverse inputs, a unified decoder for consistent outputs, and a task-adaptive memory mechanism to unify tracking across different granularities, significantly improving generalization and reducing redundancy.

Business Value

Enables more versatile and efficient video analysis systems, applicable to a wide range of applications from security to content moderation and robotics, by providing a single model for diverse tracking needs.