arxiv_cl 96% Match Research Paper AI Researchers,ML Engineers,Model Developers,Benchmark Creators 3 weeks ago

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

large-language-models › evaluation

📄 Abstract

Abstract: The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

Authors (11)

Keyan Zhou

Zecheng Tang

Lingfeng Ming

Guanghao Zhou

Qiguang Chen

Dan Qiao

+5 more

Submitted

October 15, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

MMLongCite introduces a comprehensive benchmark specifically designed to evaluate the fidelity of Large Vision-Language Models (LVLMs) in long-context scenarios, addressing a gap in current evaluations which are often text-only or limited to short multimodal contexts. The benchmark includes diverse tasks across various context lengths and modalities (text, images, videos), enabling a deeper analysis of how context length and content position affect model faithfulness.

Business Value

Provides essential tools for developers and researchers to rigorously assess and improve the reliability of long-context multimodal AI systems, crucial for applications requiring understanding of extensive visual and textual information.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

High, as it's a benchmark for evaluation, not a deployable system itself. It aids in the development of more robust models.

Limitations Addressed

Lack of multimodal evaluations for long contexts,Over-reliance on text-only evaluations,Limited understanding of context utilization in LVLMs

Technical Tags

long-context modelsvision-language modelsbenchmarkevaluationmultimodalfaithfulnesscontext utilizationLLM evaluation

Research Topics

Model EvaluationMultimodal AILarge Language ModelsComputer VisionNatural Language Processing

Methods & Architectures

MMLongCite benchmarkMultimodal evaluation tasksAnalysis of context length and content position effects Large Vision Language Models (LVLMs)

Applications & Tasks

Multimodal AI Research AI Model Evaluation Evaluating faithfulness of long-context LVLMsLimited multimodal assessments for long contextsUnderstanding context utilization Benchmarking LVLM performanceAssessing context fidelityAnalyzing model behavior with long contexts

Datasets & Benchmarks

Datasets

MMLongCite

Faithfulness

Related Fields

Machine LearningArtificial IntelligenceComputer VisionNatural Language ProcessingEvaluation Metrics

Keywords

long contextvision-languagebenchmarkevaluationmultimodalLLMLVLMfaithfulnesscontextAIdeep learning

Academic Context

#Model Evaluation#Multimodal AI#Large Language Models#Computer Vision#Natural Language Processing

Commercial Potential

Potential Products

AI model evaluation servicesBenchmarking platforms

Target Industries

TechnologyAI ResearchSoftware Development

Use Case Examples

Assessing the ability of an LVLM to answer questions based on a long video transcript and accompanying imagesEvaluating if a model can summarize a lengthy document with embedded figures

Competitive Edge

Establishes a new standard for evaluating long-context multimodal models, filling a critical gap in existing benchmarks.

Market Opportunity

Growing, as the demand for reliable multimodal AI increases.

Revenue Models

Could be part of paid evaluation services or platforms.

Resource Requirements

Compute Needs

High, for running evaluations on state-of-the-art LVLMs across diverse tasks and context lengths.

Data Requirements

The MMLongCite benchmark dataset itself.

Deployment Constraints

Requires significant computational resources to run comprehensive evaluations.

Scalability

The benchmark is designed to scale with the increasing context lengths of LVLMs.

Production Readiness

Maturity Level

Research

Time to Market

N/A (benchmark)

Patent Potential

Low, as it's a benchmark.

View Full Paper Back to Papers