PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin
2025-10-22
Summary
This paper investigates how well new Artificial Intelligence models, called Large Multimodal Models, can understand scientific papers that combine text, images, tables, and equations.
What's the problem?
These AI models are being used more and more in science, but it's unclear if they can catch mistakes or inconsistencies when information is presented in different ways within a paper. For example, a graph might show something different than what the text says. Current tests don't really check for these kinds of real-world errors, often using simple or fake problems instead. This lack of thorough testing makes it hard to trust these models to help with scientific work.
What's the solution?
The researchers created a new test called PRISMM-Bench. They found actual inconsistencies flagged by peer reviewers in over 200 scientific papers. They then designed three tasks to test if a model can identify these inconsistencies, fix them, and understand how different parts of a paper relate to each other. To prevent the models from just guessing the right answer based on patterns in the questions, they also created a new way for the models to give answers using a structured format, like a detailed JSON file, instead of just multiple choice.
Why it matters?
The results showed that even the most advanced AI models struggle with this type of multimodal reasoning, only scoring between 26% and 54% on the tests. This highlights a significant weakness in current AI and emphasizes the need for better models that can reliably understand and work with complex scientific information, ultimately leading to more trustworthy AI assistants for scientists.
Abstract
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.