Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-Language Models (VLMs) have achieved strong results in video
understanding, yet a key question remains: do they truly comprehend visual
content or only learn shallow correlations between vision and language? Real
visual understanding, especially of physics and common sense, is essential for
AI systems that interact with the physical world. Current evaluations mostly
use real-world videos similar to training data, so high benchmark scores may
not reflect real reasoning ability. To address this, we propose
negative-control tests using videos that depict physically impossible or
logically inconsistent events. We introduce VideoHallu, a synthetic dataset of
physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling.
It includes expert-annotated question-answer pairs across four categories of
violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1) show
that, despite strong results on benchmarks such as MVBench and MMVU, they often
miss these violations, exposing gaps in visual reasoning. Reinforcement
learning fine-tuning on VideoHallu improves recognition of such violations
without reducing standard benchmark performance. Our data is available at
https://github.com/zli12321/VideoHallu.git.
Authors (9)
Zongxia Li
Xiyang Wu
Guangyao Shi
Yubin Qin
Hongyang Du
Fuxiao Liu
+3 more
NeurIPS 2025
Key Contributions
This paper introduces VideoHallu, a synthetic dataset designed to evaluate and mitigate multi-modal hallucinations in video understanding by using physically impossible and logically inconsistent scenes. It addresses the limitation of current evaluations that may not reflect true reasoning ability, proposing negative-control tests to reveal VLMs' weaknesses despite strong benchmark scores.
Business Value
Improved reliability and trustworthiness of AI systems in video analysis applications, leading to safer deployment in critical areas like autonomous driving and robotics where understanding physical interactions is paramount.