📄 Abstract
Abstract: Achieving fine-grained spatio-temporal understanding in videos remains a
major challenge for current Video Large Multimodal Models (Video LMMs).
Addressing this challenge requires mastering two core capabilities: video
referring understanding, which captures the semantics of video regions, and
video grounding, which segments object regions based on natural language
descriptions. However, most existing approaches tackle these tasks in
isolation, limiting progress toward unified, referentially grounded video
interaction. We identify a key bottleneck in the lack of high-quality, unified
video instruction data and a comprehensive benchmark for evaluating
referentially grounded video chat. To address these challenges, we contribute
in three core aspects: dataset, model, and benchmark. First, we introduce
SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to
enable joint learning of video referring understanding, grounding, and
multi-turn video chat. Second, we propose the SAMA model, which incorporates a
versatile spatio-temporal context aggregator and a Segment Anything Model to
jointly enhance fine-grained video comprehension and precise grounding
capabilities. Finally, we establish SAMA-Bench, a meticulously designed
benchmark consisting of 5,067 questions from 522 videos, to comprehensively
evaluate the integrated capabilities of Video LMMs in multi-turn,
spatio-temporal referring understanding and grounded dialogue. Extensive
experiments and benchmarking results show that SAMA not only achieves strong
performance on SAMA-Bench but also sets a new state-of-the-art on general
grounding benchmarks, while maintaining highly competitive performance on
standard visual understanding benchmarks.
Authors (6)
Ye Sun
Hao Zhang
Henghui Ding
Tiehua Zhang
Xingjun Ma
Yu-Gang Jiang
Key Contributions
Addresses the challenge of fine-grained spatio-temporal understanding in videos for Video LMMs by proposing a unified approach to video referring understanding and grounding for multi-turn video chat. Introduces SAMA-239K, a large-scale dataset, and a comprehensive benchmark for evaluating these capabilities.
Business Value
Enables more intuitive and powerful video-based interaction systems, useful for remote assistance, collaborative editing, and interactive entertainment.