2000 character limit reached

PRISMM-Bench: Evaluating Multimodal Inconsistencies

Updated 25 October 2025

The paper introduces PRISMM-Bench, a novel benchmark that uses real reviewer comments to evaluate LMMs’ abilities in detecting and correcting multimodal inconsistencies in scientific papers.
It employs a rigorous, multi-stage pipeline combining LLM-assisted filtering and human verification to curate objective, reproducible inconsistency datasets.
Empirical results reveal that while advanced LMMs benefit significantly from chain-of-thought reasoning, they still struggle with long-range context and deep multimodal integration.

PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models) is a benchmark for evaluating large multimodal models (LMMs) on the authentic challenge of detecting and reasoning over real-world inconsistencies in scientific papers. Unlike prior benchmarks that focus on synthetic noise, isolated modalities, or constructed conflicts, PRISMM-Bench directly utilizes reviewer-flagged errors from actual scientific peer review feedback, enabling an assessment of models’ reliability in the context of complex, domain-specific research artifacts. PRISMM-Bench provides a curated set of multimodal inconsistencies from published papers, a triad of tasks assessing detection and correction capabilities, and a structured evaluation protocol that emphasizes robust, context-sensitive reasoning over superficial answer heuristics.

1. Motivation and Conceptual Distinction

Scientific papers frequently contain subtle inconsistencies across their modalities—text, figures, tables, and equations—that can degrade clarity, reproducibility, and trust. Existing benchmarks typically address single-modality reasoning or employ synthetically injected errors, omitting the nuanced and context-dependent inconsistencies prevalent in real research output. PRISMM-Bench is distinct in that it systematically mines reviewer comments to identify genuine multimodal contradictions, such as misaligned reward functions between a figure and manuscript text. The benchmark thus aims to expose whether LMMs can move beyond shallow pattern recognition to robust multimodal scientific reasoning. This approach is motivated by a gap observed in prior work—for instance, MMIR (Yan et al., 22 Feb 2025) introduces synthetic reasoning errors for webpages and posters, but PRISMM-Bench grounds its inconsistencies in the authentic peer-review process.

2. Data Collection and Inconsistency Curation

PRISMM-Bench constructs its dataset using a multi-stage review mining pipeline:

Review Sourcing: The dataset begins with the aggregation of 12,366 reviewer comments from ICLR 2025, prioritizing rejected or withdrawn papers (without rebuttal) to ensure inconsistencies are verifiable in public documents.
LLM-Assisted Filtering: An LLM (Mistral Nemo 2407) summarizes reviews and extracts candidate sentences likely to mention multimodal inconsistencies, using keyword detection for terms such as “mismatch” and “conflict” in proximity to references of figures, tables, or equations. This pass reduces candidates to 5,258.
Human Verification and Annotation: Annotators use a custom interface to confirm objective inconsistencies, mark the corresponding segments in the PDF (text passages and image crops), and ensure both sides of the contradiction are explicit and discoverable without excessive domain knowledge.

The final dataset comprises 262 validated multimodal inconsistencies sourced from 242 scientific papers. The annotation protocol emphasizes clarity and reproducibility, excluding stylistic or merely suggestive comments. Figure 1 of the primary paper provides a schematic overview of this pipeline.

3. Evaluation Tasks and Contextual Design

PRISMM-Bench organizes its evaluation around three core tasks:

Inconsistency Identification (Ident): Given a multimodal paper excerpt and a generic prompt (“What is the inconsistency in these parts of a scientific paper?”), models must identify the precise contradiction.
Inconsistency Remedy (Remedy): Models are required to propose actionable corrections by selecting from choices that specify modifications, replacements, or repositioning of elements.
Inconsistency Pair-Match (Match): For inconsistencies involving two visual elements, the task is to pair each element with its conflicting counterpart, often with textual cues excluded, thus emphasizing purely visual reasoning.

This task design forces models to generalize beyond explicit instructions—the generic question format in the Ident task prevents overfitting to question content, while Remedy and Match tasks probe the ability to not just detect, but correct or reconcile inconsistencies across modalities.

4. Structured Answer Representation and Debiasing Protocol

A known problem in multiple-choice evaluation is the exploitation of superficial answer heuristics, such as positional or length biases. PRISMM-Bench introduces a structured JSON-based answer format to mitigate these effects. Inconsistency identification options conform to an “Evidence–Claim” schema:

{
  "letter": "A" | "B" | "C" | "D",
  "attribute": string,
  "claim": { "source": "expectation" | string, "statement": string },
  "evidence": { "source": string, "statement": string }
}

The Remedy task adopts a “Target–Action” schema, specifying the element to correct, the action type, and a concise edit statement. By enforcing linguistic uniformity and explicit semantic structure, this protocol reduces model reliance on pattern matching and encourages genuine multimodal content engagement. Empirical analysis showed that “without context” accuracy dropped significantly when models were evaluated on the structured format, confirming reduced shortcut exploitation.

A quantitative metric, the Visual Reliance Ratio $(R)$ , is defined: $R = \frac{Acc_{\text{with context}} - Acc_{\text{without context}}}{1 - Acc_{\text{without context}}}$ Higher $R$ indicates stronger reliance on multimodal context over linguistic cues.

5. Empirical Evaluation and Benchmarking Results

PRISMM-Bench was used to assess 21 leading LMMs, spanning open-weight models (e.g., GLM-4.5V 106B, InternVL3 78B) and proprietary systems (e.g., Gemini 2.5 Pro, GPT-5 with high reasoning). Key findings include:

Performance Spectrum: Accuracy on benchmark tasks ranged from 26.1% (lowest open-weight models) to 54.2% (GPT-5 and Gemini 2.5 Pro), with all models far from reliable scientific assistance.
Contextual Granularity: Models performed best under “Focused” context (cropped to the minimum relevant text or image region); accuracy degraded as context broadened to full-page or full-document scale, revealing limitations in long-range and context-distraction robustness.
Task Difficulty: Remedy (correction) and Pair-Match (pure visual reasoning) scores were consistently lower than Identification, indicating increased difficulty with actionable reasoning and non-textual inference.
Reasoning Ablations: Models with enabled chain-of-thought reasoning (e.g., InternVL3.5 with reasoning) outperformed their ablated counterparts (“CoT-off”), with ablations causing drops of 19–34 percentage points. This demonstrates the critical role of stepwise multimodal reasoning in detecting and correcting inconsistencies.
Bias Sensitivity: When evaluated without multimodal context (i.e., on answers alone), models achieved non-trivial accuracy, further affirming the necessity of structured debiasing in multiple-choice protocols.

6. Analytical Comparison and Relation to Prior Benchmarks

PRISMM-Bench advances multimodal evaluation by grounding all inconsistencies in authentic peer review scenes. MMIR (Yan et al., 22 Feb 2025) targets synthetic layout-rich artifacts (slides, posters) with injected errors across five categories, finding that models are more reliable at pairwise versus single-element inconsistencies and benefit from multimodal interleaving (MM-CoT). R-Bench (Li et al., 7 Oct 2024) evaluates robustness under real-world image corruptions, establishing a framework of 33 corruption dimensions and showing that LMMs lag behind human vision in robustness. MMReview (Gao et al., 19 Aug 2025) evaluates automated LLM-based peer review generation, probing stepwise reasoning across modalities and domains, but does not focus specifically on multimodal scientific inconsistencies flagged by human experts. XModBench (Wang et al., 16 Oct 2025) systematically diagnoses cross-modal consistency in omni-LLMs, identifying modality disparities and directional imbalance but stopping short of the scientific domain-specific inconsistency reasoning required by PRISMM-Bench.

Collectively, PRISMM-Bench complements and extends these works by defining a benchmark space wherein the evaluation signal is determined by trusted domain experts reporting real multimodal errors, and where reasoning, correction, and cross-modal understanding are assessed with strict debiasing controls.

7. Implications, Limitations, and Future Directions

PRISMM-Bench uncovers that even advanced LMMs exhibit limited reliability on authentic scientific inconsistency detection and correction, with high susceptibility to linguistic answer heuristics and insufficient multimodal document reasoning. This suggests that architectures capable of sustained, context-sensitive multimodal inference—incorporating both focused local and long-range global reasoning—are requisite for trustworthy scientific assistants. Chain-of-thought or modular reasoning approaches may offer promising improvements, as evidenced by ablation analyses.

A plausible implication is that future development should emphasize:

Expansion of the inconsistency corpus to cover a wider range of fields and accepted papers, enabling broader generalization testing.
Enhanced modular reasoning architectures capable of cross-modality and long-context integration.
Refinement and standardization of structured answer formats and debiasing protocols to further reduce shortcut exploitation.
Systematic user studies comparing machine and human performance to precisely locate reasoning gaps.

PRISMM-Bench offers a rigorous diagnostic tool for the next generation of multimodal scientific reasoning systems, directly informing research into both methodology and trusted deployment in research practices.