📄 Abstract
Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception,
understanding, and reasoning. However, current benchmarks inadequately evaluate
their ability to perform these tasks continuously in dynamic, real-world
environments. To bridge this gap, we introduce RTV-Bench, a fine-grained
benchmark for MLLM real-time video analysis. RTV-Bench uses three key
principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve
with scene changes; (2) Hierarchical Question Structure, combining basic and
advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability
of continuous perception, understanding, and reasoning. RTV-Bench contains 552
diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated
leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline
(Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5,
InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source
real-time models largely outperform offline ones but still trail top
proprietary models. Our analysis also reveals that larger model size or higher
frame sampling rates do not significantly boost RTV-Bench performance,
sometimes causing slight decreases. This underscores the need for better model
architectures optimized for video stream processing and long sequences to
advance real-time video analysis with MLLMs. Our benchmark toolkit is available
at: https://github.com/LJungang/RTV-Bench.
Authors (14)
Shuhang Xun
Sicheng Tao
Jungang Li
Yibo Shi
Zhixin Lin
Zhanhui Zhu
+8 more
Key Contributions
Introduces RTV-Bench, a fine-grained benchmark for evaluating Multimodal Large Language Models (MLLMs) in continuous real-time video analysis. It addresses the limitations of existing benchmarks by incorporating multi-timestamp QA, hierarchical questions, and multi-dimensional evaluation metrics.
Business Value
Enables more accurate assessment of AI models for video understanding applications, leading to better product development and deployment in areas like autonomous driving and content analysis.