Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The rapid advancement of large vision language models (LVLMs) has led to a
significant expansion of their context windows. However, an extended context
window does not guarantee the effective utilization of the context, posing a
critical challenge for real-world applications. Current evaluations of such
long-context faithfulness are predominantly focused on the text-only domain,
while multimodal assessments remain limited to short contexts. To bridge this
gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate
the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8
distinct tasks spanning 6 context length intervals and incorporates diverse
modalities, including text, images, and videos. Our evaluation of
state-of-the-art LVLMs reveals their limited faithfulness in handling long
multimodal contexts. Furthermore, we provide an in-depth analysis of how
context length and the position of crucial content affect the faithfulness of
these models.
Authors (11)
Keyan Zhou
Zecheng Tang
Lingfeng Ming
Guanghao Zhou
Qiguang Chen
Dan Qiao
+5 more
Submitted
October 15, 2025
Key Contributions
MMLongCite introduces a comprehensive benchmark specifically designed to evaluate the fidelity of Large Vision-Language Models (LVLMs) in long-context scenarios, addressing a gap in current evaluations which are often text-only or limited to short multimodal contexts. The benchmark includes diverse tasks across various context lengths and modalities (text, images, videos), enabling a deeper analysis of how context length and content position affect model faithfulness.
Business Value
Provides essential tools for developers and researchers to rigorously assess and improve the reliability of long-context multimodal AI systems, crucial for applications requiring understanding of extensive visual and textual information.