arxiv_cv 95% Match Research Paper AI Researchers,MLLM Developers,Benchmark Creators,ML Engineers 1 week ago

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

large-language-models › evaluation

📄 Abstract

Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

Authors (14)

Shuhang Xun

Sicheng Tao

Jungang Li

Yibo Shi

Zhixin Lin

Zhanhui Zhu

+8 more

Submitted

May 4, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces RTV-Bench, a fine-grained benchmark for evaluating Multimodal Large Language Models (MLLMs) in continuous real-time video analysis. It addresses the limitations of existing benchmarks by incorporating multi-timestamp QA, hierarchical questions, and multi-dimensional evaluation metrics.

Business Value

Enables more accurate assessment of AI models for video understanding applications, leading to better product development and deployment in areas like autonomous driving and content analysis.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

N/A (This is a benchmark, not a deployable system).

Limitations Addressed

Inadequate evaluation of MLLM capabilities in dynamic, real-world environments,Lack of benchmarks for continuous perception, understanding, and reasoning in video,Static nature of existing video benchmarks

Performance Gains

Provides a standardized evaluation framework to identify performance gaps in leading MLLMs.

Technical Tags

multimodal large language models (MLLMs)real-time video analysiscontinuous perceptionunderstandingreasoningbenchmarkmulti-timestamp QAhierarchical questions

Research Topics

LLM EvaluationVideo UnderstandingMultimodal ReasoningBenchmark DesignReal-time AI

Methods & Architectures

Multi-Timestamp Question Answering (MTQA)Hierarchical Question StructureMulti-dimensional Evaluation MLLMs (e.g., GPT-4o, Gemini 2.0, Qwen2.5-VL, VideoLLaMA3, VITA-1.5, InternLM-XComposer2.5-OmniLive)

Applications & Tasks

Video Analysis Surveillance Autonomous Systems Content Moderation Model EvaluationVideo UnderstandingContinuous PerceptionReasoning Evaluating MLLMs on continuous video understandingAssessing real-time perception, understanding, and reasoning capabilitiesBenchmarking performance on dynamic video content

Datasets & Benchmarks

Datasets

RTV-Bench

AccuracyPerception metricsUnderstanding metricsReasoning metrics

Related Fields

Computer VisionNatural Language ProcessingMultimodal AIMachine Learning EvaluationVideo Processing

Keywords

MLLMbenchmarkreal-time videocontinuous perceptionunderstandingreasoningquestion answeringvideo analysisevaluationdynamic environmentsmultimodal AI

Academic Context

#LLM Evaluation#Video Understanding#Multimodal Reasoning#Benchmark Design#Real-time AI

Companies & Organizations

Companies Mentioned

OpenAI Google

Commercial Potential

Potential Products

Standardized MLLM evaluation suitesTools for real-time video analysis

Target Industries

TechnologyMediaSecurityAutomotive

Use Case Examples

Evaluating the ability of an MLLM to track objects and answer questions about their state changes over time in a surveillance video.Assessing an MLLM's understanding of complex interactions between multiple agents in a sports game video.Benchmarking models for real-time content moderation in live video streams.

Competitive Edge

Provides a more comprehensive and realistic evaluation framework compared to existing static video benchmarks.

Market Opportunity

Significant market interest in reliable MLLM evaluation tools.

Revenue Models

Could be part of commercial evaluation services or platforms.

Resource Requirements

Compute Needs

N/A (Benchmark)

Data Requirements

Requires diverse, long-form videos with high-quality, temporally evolving annotations.

Deployment Constraints

N/A (Benchmark)

Scalability

The benchmark itself is scalable by adding more videos and QA pairs.

Regulatory Considerations

Data privacy and copyright for the video content used in the benchmark.

Production Readiness

Maturity Level

Research

Time to Market

N/A (Benchmark)

Patent Potential

Low (benchmarks are typically not patentable).

View Full Paper Back to Papers