📄 Abstract
Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific
research, yet it remains unclear whether they can reliably understand and
reason over the multimodal complexity of papers. A central challenge lies in
detecting and resolving inconsistencies across text, figures, tables, and
equations, issues that are often subtle, domain-specific, and ultimately
undermine clarity, reproducibility, and trust. Existing benchmarks overlook
this issue, either isolating single modalities or relying on synthetic errors
that fail to capture real-world complexity. We introduce PRISMM-Bench
(Peer-Review-sourced Inconsistency Set for Multimodal Models), the first
benchmark grounded in real reviewer-flagged inconsistencies in scientific
papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering
and human verification, we curate 262 inconsistencies from 242 papers. Based on
this set, we design three tasks, namely inconsistency identification, remedy
and pair matching, which assess a model's capacity to detect, correct, and
reason over inconsistencies across different modalities. Furthermore, to
address the notorious problem of choice-only shortcuts in multiple-choice
evaluation, where models exploit answer patterns without truly understanding
the question, we further introduce structured JSON-based answer representations
that minimize linguistic biases by reducing reliance on superficial stylistic
cues. We benchmark 21 leading LMMs, including large open-weight models
(GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5
with high reasoning). Results reveal strikingly low performance (26.1-54.2%),
underscoring the challenge of multimodal scientific reasoning and motivating
progress towards trustworthy scientific assistants.
Authors (7)
Lukas Selch
Yufang Hou
M. Jehanzeb Mirza
Sivan Doveh
James Glass
Rogerio Feris
+1 more
Submitted
October 18, 2025
Key Contributions
PRISMM-Bench is the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers, addressing the limitations of existing benchmarks that overlook subtle, domain-specific cross-modal issues. It provides a curated set of 262 inconsistencies and three tasks to evaluate LMMs' ability to identify, resolve, and reason about these complex errors.
Business Value
Enhances the reliability and trustworthiness of AI tools used in scientific research and publishing, leading to more accurate analysis, better reproducibility, and faster discovery.