arxiv_cv 95% Match Research Paper AI Researchers,Video Understanding Specialists,Developers of interactive AI systems,NLP Researchers 1 week ago

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

large-language-models › multimodal-llms

📄 Abstract

Abstract: Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.

Authors (6)

Ye Sun

Hao Zhang

Henghui Ding

Tiehua Zhang

Xingjun Ma

Yu-Gang Jiang

Submitted

May 24, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Addresses the challenge of fine-grained spatio-temporal understanding in videos for Video LMMs by proposing a unified approach to video referring understanding and grounding for multi-turn video chat. Introduces SAMA-239K, a large-scale dataset, and a comprehensive benchmark for evaluating these capabilities.

Business Value

Enables more intuitive and powerful video-based interaction systems, useful for remote assistance, collaborative editing, and interactive entertainment.

Paper Metadata

Innovation Type

Dataset & Benchmark

Deployment Feasibility

Moderate. Requires robust Video LMMs and integration into interactive platforms.

Limitations Addressed

Lack of high-quality, unified video instruction data for referentially grounded video chat,Existing approaches tackle referring understanding and grounding in isolation,Limited progress toward unified, referentially grounded video interaction

Performance Gains

Enables joint learning, leading to improved performance on referentially grounded video chat tasks.

Technical Tags

referential grounded video chatvideo large multimodal models (Video LMMs)spatio-temporal understandingvideo referring understandingvideo groundingmulti-turn interactionSAMA-239K datasetunified benchmark

Research Topics

Video UnderstandingMultimodal InteractionLLMs for VideoGroundingDialogue Systems

Methods & Architectures

Joint learning of video referring understanding, grounding, and multi-turn chatDataset curation (SAMA-239K)Benchmark development Video LMMs

Applications & Tasks

Video Interaction Robotics Virtual Assistants Content Creation Video GroundingReferring Expression ComprehensionVideo DialogueSpatio-temporal Reasoning Enabling fine-grained spatio-temporal understanding in videosFacilitating referentially grounded video chatJointly learning referring understanding and grounding

Datasets & Benchmarks

Datasets

SAMA-239K

Metrics for referring understandingMetrics for groundingMetrics for video chat quality

Related Fields

Computer VisionNatural Language ProcessingMultimodal AIVideo UnderstandingDialogue Systems

Keywords

video chatreferential groundingspatio-temporal understandingVideo LMMsvideo understandingdialogue systemsSAMA-239Kmultimodal AIcomputer visionNLP

Academic Context

#Video Understanding#Multimodal Interaction#LLMs for Video#Grounding#Dialogue Systems

Commercial Potential

Potential Products

Interactive video platformsAI-powered video editing toolsRobotic teleoperation systemsEnhanced virtual assistants

Target Industries

TechnologyMediaEntertainmentRoboticsTelecommunications

Use Case Examples

A user asking an AI to 'find the red ball on the left side of the screen and zoom in'.Collaborative video editing where users can refer to specific objects or regions.Robots performing tasks based on natural language instructions referencing visual elements.

Competitive Edge

Provides a unified approach and dataset for tasks previously treated in isolation, enabling more holistic video interaction capabilities.

Market Opportunity

Growing market for AI-driven video analysis and interactive media.

Revenue Models

Licensing of the technologydevelopment of specialized video interaction platforms.

Resource Requirements

Compute Needs

Requires significant compute for training Video LMMs and for inference, especially for real-time interaction.

Data Requirements

Requires a large-scale dataset of videos with detailed annotations for referring expressions, object segmentation, and dialogue context.

Deployment Constraints

Real-time processing latency,Accuracy of spatio-temporal understanding,Complexity of multi-turn dialogue management

Scalability

Scalability depends on the efficiency of the Video LMM architecture and the data processing pipeline.

Regulatory Considerations

Data privacy and content moderation for user-generated video content.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust product integration.

Patent Potential

Moderate, for the dataset creation methodology and the unified model architecture.

View Full Paper Back to Papers