arxiv_cv 95% Match Research Paper AI researchers,ML engineers,Developers of vision-language models,AI safety professionals 1 week ago

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

large-language-models › multimodal-llms

📄 Abstract

Abstract: Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations between vision and language? Real visual understanding, especially of physics and common sense, is essential for AI systems that interact with the physical world. Current evaluations mostly use real-world videos similar to training data, so high benchmark scores may not reflect real reasoning ability. To address this, we propose negative-control tests using videos that depict physically impossible or logically inconsistent events. We introduce VideoHallu, a synthetic dataset of physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling. It includes expert-annotated question-answer pairs across four categories of violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1) show that, despite strong results on benchmarks such as MVBench and MMVU, they often miss these violations, exposing gaps in visual reasoning. Reinforcement learning fine-tuning on VideoHallu improves recognition of such violations without reducing standard benchmark performance. Our data is available at https://github.com/zli12321/VideoHallu.git.

Authors (9)

Zongxia Li

Xiyang Wu

Guangyao Shi

Yubin Qin

Hongyang Du

Fuxiao Liu

+3 more

Submitted

May 2, 2025

arXiv Category

cs.CV

NeurIPS 2025

arXiv PDF

Key Contributions

This paper introduces VideoHallu, a synthetic dataset designed to evaluate and mitigate multi-modal hallucinations in video understanding by using physically impossible and logically inconsistent scenes. It addresses the limitation of current evaluations that may not reflect true reasoning ability, proposing negative-control tests to reveal VLMs' weaknesses despite strong benchmark scores.

Business Value

Improved reliability and trustworthiness of AI systems in video analysis applications, leading to safer deployment in critical areas like autonomous driving and robotics where understanding physical interactions is paramount.

Paper Metadata

Innovation Type

Dataset and Evaluation Methodology

Deployment Feasibility

High, as it provides a methodology and dataset for evaluating existing models, rather than requiring new model development.

Limitations Addressed

Current evaluations of VLMs for video understanding rely on real-world videos similar to training data, potentially masking shallow correlations and lack of true reasoning. VideoHallu addresses this by providing synthetic, challenging scenarios to test deeper comprehension.

Technical Tags

vision-language modelssynthetic video generationhallucination detectionnegative-control testsphysics understandingcommon sense reasoningvideo understandingmulti-modal evaluationadversarial attacksdata augmentation

Research Topics

Evaluating AI ComprehensionMitigating HallucinationsSynthetic Data for AI TrainingRobustness of Vision-Language ModelsVideo Understanding Benchmarking

Methods & Architectures

Negative-control testingSynthetic dataset generationQuestion-answeringModel evaluation Vision-Language Models (VLMs)Qwen-2.5-VLVideo-R1VideoChat-R1

Applications & Tasks

Video Understanding AI Safety Robotics Autonomous Systems Evaluating true comprehension vs. shallow correlationsDetecting hallucinations in video understandingAssessing common sense and physics reasoning in AI Video understandingQuestion answering on videosEvaluating VLM robustness

Datasets & Benchmarks

Datasets

VideoHallu, MVBench, MMVU

Accuracy on question-answering tasks

Related Fields

Artificial IntelligenceMachine LearningComputer VisionNatural Language ProcessingAI Ethics

Keywords

Vision-Language ModelsVideo UnderstandingHallucinationsSynthetic DataEvaluationRobustnessCommon SensePhysicsAdversarial ExamplesAI SafetyMulti-modal AIDatasetBenchmarkingReasoning

Academic Context

#Evaluating AI Comprehension#Mitigating Hallucinations#Synthetic Data for AI Training#Robustness of Vision-Language Models#Video Understanding Benchmarking

Commercial Potential

Potential Products

AI model evaluation suitesTools for detecting AI hallucinationsSynthetic data generation platforms

Target Industries

TechnologyAutomotiveRoboticsMedia and Entertainment

Use Case Examples

Testing autonomous driving systems' understanding of traffic scenariosEnsuring AI assistants don't generate false information from video feedsValidating AI's grasp of physical laws in simulated environments

Competitive Edge

Offers a novel evaluation approach and dataset that goes beyond existing benchmarks by focusing on synthetic, challenging scenarios to uncover deeper reasoning flaws in VLMs.

Market Opportunity

Growing market for reliable and trustworthy AI systems.

Revenue Models

Licensing of evaluation tools or datasets.

Resource Requirements

Compute Needs

Moderate for evaluation, high for generating synthetic data.

Data Requirements

Requires generation of synthetic video scenes with specific violations.

Deployment Constraints

None directly related to the evaluation methodology itself.

Scalability

The methodology can be scaled by generating more diverse and complex synthetic scenarios.

Production Readiness

Maturity Level

Research

Time to Market

N/A (evaluation methodology)

View Full Paper Back to Papers