Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to
assess models' ability to understand long videos, with a focus on human
language, viewpoints, actions, and other contextual elements, while integrating
\textbf{visual, audio, and text} modalities. Our benchmark excels in three key
areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select
approximately 1,000 videos from open-source datasets FineVideo based on
duration limit and the information density of both visual and audio modalities,
focusing on content like lectures, interviews, and vlogs, which contain rich
language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have
designed six challenging task scenarios, including both Intra-Event and
Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance
Pipelines:} We have developed a three-step, semi-automated data quality
assurance pipeline to ensure the difficulty and validity of the synthesized
questions and answer options. Based on LongInsightBench, we designed a series
of experiments. Experimental results shows that Omni-modal models(OLMs) still
face challenge in tasks requiring precise temporal localization (T-Loc) and
long-range causal inference (CE-Caus). Extended experiments reveal the
information loss and processing bias in multi-modal fusion of OLMs. Our dataset
and code is available at
https://anonymous.4open.science/r/LongInsightBench-910F/.
Authors (6)
ZhaoYang Han
Qihan Lin
Hao Liang
Bowen Chen
Zhou Liu
Wentao Zhang
Submitted
October 20, 2025
Key Contributions
Introduces LongInsightBench, the first comprehensive benchmark specifically designed for evaluating omni-modal models on human-centric long-video understanding. It features long-duration, information-dense videos (lectures, interviews, vlogs), diverse and challenging task scenarios (intra- and inter-event), and a rigorous quality assurance pipeline, addressing the need for better evaluation in this domain.
Business Value
Enables the development of more sophisticated AI systems for analyzing long-form video content, such as automated content summarization, enhanced search capabilities, and better understanding of human interactions in videos.