arxiv_cv 98% Match Research Paper LMM Researchers,AI Evaluation Specialists,Scientific Publishers,Researchers using AI tools 2 weeks ago

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

large-language-models › multimodal-llms

📄 Abstract

Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

Authors (7)

Lukas Selch

Yufang Hou

M. Jehanzeb Mirza

Sivan Doveh

James Glass

Rogerio Feris

+1 more

Submitted

October 18, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

PRISMM-Bench is the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers, addressing the limitations of existing benchmarks that overlook subtle, domain-specific cross-modal issues. It provides a curated set of 262 inconsistencies and three tasks to evaluate LMMs' ability to identify, resolve, and reason about these complex errors.

Business Value

Enhances the reliability and trustworthiness of AI tools used in scientific research and publishing, leading to more accurate analysis, better reproducibility, and faster discovery.

Paper Metadata

Innovation Type

Dataset/Benchmark

Deployment Feasibility

High, as it's a benchmark for evaluation, not a deployable system itself. Facilitates the development of more robust LMMs.

Limitations Addressed

Existing benchmarks fail to capture the subtle, domain-specific, and real-world inconsistencies found in scientific papers across text, figures, and tables, which are crucial for reliable reasoning and reproducibility.

Technical Tags

Multimodal LLMsBenchmarkScientific PapersInconsistency DetectionPeer ReviewLLM EvaluationText-Figure AlignmentTable-Text ConsistencyEquation VerificationDomain-Specific Reasoning

Research Topics

Large Multimodal ModelsAI EvaluationScientific Document UnderstandingReproducibility in AITrustworthy AI

Methods & Architectures

Benchmark CurationReview MiningLLM-assisted FilteringHuman VerificationInconsistency Identification TaskRemedy TaskPair Matching Task

Applications & Tasks

Scientific Research Academic Publishing AI Model Development Evaluating LMMs on Scientific ContentDetecting Cross-Modal InconsistenciesEnsuring ReproducibilityImproving Trustworthiness of AI in Science Inconsistency IdentificationInconsistency RemedyPair Matching of Inconsistent Elements

Datasets & Benchmarks

Datasets

PRISMM-Bench

Related Fields

Natural Language ProcessingComputer VisionArtificial IntelligenceScientific PublishingInformation Retrieval

Keywords

Multimodal LLMsBenchmarkScientific PapersInconsistencyEvaluationReproducibilityTrustworthy AIText-Figure AlignmentTablesEquationsPeer ReviewDomain SpecificityReasoning

Academic Context

#Large Multimodal Models#AI Evaluation#Scientific Document Understanding#Reproducibility in AI#Trustworthy AI

Commercial Potential

Potential Products

AI-powered scientific manuscript reviewerLMM evaluation suite for research toolsAutomated fact-checking for scientific literature

Target Industries

Academic PublishingScientific ResearchAI DevelopmentInformation Services

Use Case Examples

Validating LMMs for scientific literature analysisIdentifying errors in AI-generated scientific summariesEnsuring consistency in multimodal scientific datasets

Competitive Edge

Unique in its focus on real-world, peer-review-sourced inconsistencies in scientific papers, offering a more challenging and relevant evaluation than synthetic or single-modality benchmarks.

Market Opportunity

Growing demand for reliable AI tools in scientific research.

Revenue Models

N/A (Benchmark)

Resource Requirements

Compute Needs

N/A (Benchmark)

Data Requirements

Requires access to scientific papers and their associated peer reviews.

Deployment Constraints

N/A (Benchmark)

Scalability

The methodology for curating the benchmark can be scaled to include more papers and inconsistencies.

Production Readiness

Maturity Level

Benchmark Development

Time to Market

N/A (Benchmark)

Patent Potential

Low (Benchmark)

View Full Paper Back to Papers