Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 96% Match Research Paper AI Researchers,ML Engineers,Model Developers,Benchmark Creators 3 weeks ago

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

large-language-models › evaluation
📄 Abstract

Abstract: The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.
Authors (11)
Keyan Zhou
Zecheng Tang
Lingfeng Ming
Guanghao Zhou
Qiguang Chen
Dan Qiao
+5 more
Submitted
October 15, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

MMLongCite introduces a comprehensive benchmark specifically designed to evaluate the fidelity of Large Vision-Language Models (LVLMs) in long-context scenarios, addressing a gap in current evaluations which are often text-only or limited to short multimodal contexts. The benchmark includes diverse tasks across various context lengths and modalities (text, images, videos), enabling a deeper analysis of how context length and content position affect model faithfulness.

Business Value

Provides essential tools for developers and researchers to rigorously assess and improve the reliability of long-context multimodal AI systems, crucial for applications requiring understanding of extensive visual and textual information.