Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Video Analysis Developers,Machine Learning Engineers 2 weeks ago

LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

computer-vision › video-understanding
📄 Abstract

Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.
Authors (6)
ZhaoYang Han
Qihan Lin
Hao Liang
Bowen Chen
Zhou Liu
Wentao Zhang
Submitted
October 20, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces LongInsightBench, the first comprehensive benchmark specifically designed for evaluating omni-modal models on human-centric long-video understanding. It features long-duration, information-dense videos (lectures, interviews, vlogs), diverse and challenging task scenarios (intra- and inter-event), and a rigorous quality assurance pipeline, addressing the need for better evaluation in this domain.

Business Value

Enables the development of more sophisticated AI systems for analyzing long-form video content, such as automated content summarization, enhanced search capabilities, and better understanding of human interactions in videos.