Papers
Topics
Authors
Recent
2000 character limit reached

MM-THEBench: Multimodal Reasoning Benchmark

Updated 6 February 2026
  • MM-THEBench is a benchmark framework that systematically analyzes hallucinations in multimodal LLM reasoning by dissecting intermediate chain-of-thought steps.
  • It employs a multi-layered evaluation pipeline with cognitive-science taxonomy and human-validated annotations to ensure precise error localization.
  • The framework reveals a precision–length trade-off where longer CoT traces improve recall but increase hallucination, highlighting challenges in balancing detail and accuracy.

MM-THEBench is a comprehensive benchmark framework for analyzing hallucination phenomena in reasoning multimodal LLMs (MLLMs), with a focus on correctness and interpretability of intermediate chains-of-thought (CoT) in complex @@@@1@@@@ tasks. Unlike legacy multimodal benchmarks emphasizing final-answer accuracy, MM-THEBench systematically probes and categorizes hallucinations at the step level within the reasoning trace. Leveraging a taxonomy rooted in cognitive science, diverse annotated data, and a multi-layered automated evaluation pipeline, it enables rigorous, reproducible assessment of both perceptual and reasoning fidelity in advanced MLLMs (Huang et al., 30 Jan 2026).

1. Motivation and Scope

Recent reasoning MLLMs, such as GPT-5, OpenAI-o3, and Gemini-2.5-pro, generate explicit, multi-step CoT sequences for multimodal problem solving. Although these models achieve high headline performance on visual and procedural reasoning, their CoT traces commonly display inconsistencies, perceptual errors, or logical breakdowns that are masked by correct final answers. Previous benchmarks—including VQAv2, MMBench, and ZeroBench—are limited by their focus on final outputs and by their lack of granularity in hallucination or reasoning error analysis. MM-THEBench addresses these gaps by dissecting the internal thinking process of MLLMs, providing a rubricked, cognitive-dimension-aligned view of hallucination loci and types (Huang et al., 30 Jan 2026).

2. Hallucination Taxonomy Rooted in Cognitive Dimensions

MM-THEBench's taxonomy is organized in two layers: top-level cognitive domains and granular subcategories.

Top-Level Domains:

  • Knowledge (K): Factual acquisition or use outside the current percept/control context (e.g., background world knowledge).
  • Perception (P): Extraction and discrimination of features from presented visual or audio input.
  • Reasoning (R): Logical, arithmetic, or inferential steps manipulating existing representations.

Subcategories:

Domain Subcategories
Knowledge K₁ Commonsense, K₂ World Knowledge, K₃ Domain Knowledge
Perception P₁ Recognition, P₂ OCR, P₃ Spatial, P₄ Counting, P₅ Audio, P₆ Grounding, P₇ Temporal
Reasoning R₁ Deductive, R₂ Inductive/Generalization, R₃ Spatial Reasoning, R₄ Arithmetic, R₅ Causal, R₆ Decision Making, R₇ Instructional

Each atomic CoT step and hallucination instance is annotated according to this fine-grained typology, facilitating precise error localization and downstream mitigation.

3. Dataset Construction and Annotation Pipeline

MM-THEBench samples 1340 multimodal QA items from eight established data sources: MathVision (21.5%), MM-vet-v2 (11.2%), MMMU-pro (13.8%), HallusionBench (14.9%), Omni-Spatial (14.9%), CharXiv (7.5%), GUI-Agent (7.4%), and Video-MME (8.9%). Only items demanding nontrivial reasoning are included. Step-wise chain-of-thought annotations are bootstrapped using Gemini-2.5-pro reasoning traces and are meticulously validated, split, or merged by human annotators to ensure atomicity, minimality, and alignment with the cognitive taxonomy. Each annotated CoT step links to a set of rubric items specifying satisfaction, dimension, and stepwise difficulty/capability ratings. Quality control imposes a two-phase audit, requiring <30% unqualified label rate globally (Huang et al., 30 Jan 2026).

4. Automated Multi-Level Evaluation Protocol

MM-THEBench employs a multi-stage LLM-based evaluation pipeline with Qwen-3-32B acting in four judge roles:

  1. Answer Extraction/Verification: Parses final answer, checks exactness or IoU for localization.
  2. Step Segmentation: Splits model-generated text into atomic reasoning steps.
  3. Step Matching: Aligns generated steps to rubric-labeled ground-truth, computes precision, recall, and F1 scores.
  4. Rubric Scoring: Assesses satisfaction and hallucination flags per rubric item, computing normalized subdimension scores and aggregate hallucination (“H-score”).

Key evaluation metrics include:

  • Final answer accuracy:

ACC=NMATCH+0.5NPARTIALNtotal\mathrm{ACC} = \frac{N_{\mathrm{MATCH}} + 0.5\,N_{\mathrm{PARTIAL}}}{N_{\mathrm{total}}}

  • Step-level precision, recall, and F1:

Precision=TPTP+FP,Recall=TPTP+FN,F1=2Precision×RecallPrecision+Recall\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}, \quad \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}, \quad \mathrm{F1} = \frac{2\,\mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

  • Rubric-level normalized scores SK,SP,SR[0,100]S_K, S_P, S_R \in [0,100]
  • Hallucination-free score:

H=1Total rubric items marked “No” with hallucinationTotal rubric itemsH = 1 - \frac{\text{Total rubric items marked ``No'' with hallucination}}{\text{Total rubric items}}

  • Relative thinking length and optional token-level hallucination rate.

5. Experimental Findings and Comparative Performance

Image-based tasks show final answer accuracy of 50–70% across SOTA MLLMs. Qwen3-VL-235B and GPT-5 peak at 70.6% and 67.9% respectively. However, step-level precision remains much lower (20–40%), indicating substantial hallucination in intermediate reasoning even when answers are correct. Rubric-level S_K, S_P, S_R scores typically range from 65–90, but perception subdimensions consistently lag reasoning. All models achieve high overall H-scores (>90).

Video-based tasks yield lower accuracy (41–81.6%) and lower F1 step-alignment scores (16–54%), reflecting increased multimodal complexity. Reasoning hallucinations, especially in spatial (P₃, R₃) and arithmetic (R₄) steps, are highly predictive of incorrect answers, exhibiting strong odds ratios and significant statistical correlation (odds ratio >2, p<0.01p < 0.01; r<0.25,p<0.01r < 0.25, p < 0.01).

A precision–length trade-off is observed: longer explicit CoTs yield higher recall but decreased precision, indicating a tendency for overthinking to introduce extraneous or erroneous steps. Hallucination rates for correct answers remain up to 15% in step chains, exceeding 30% in incorrect cases.

6. Insights, Limitations, and Future Directions

Intermediate CoT traces are only weakly reliable: high headline accuracy hides copious hallucination, especially in spatial perception and reasoning. Perceptual (P) errors are common but often benign, whereas reasoning (R) and mixed errors are much more likely to undermine final correctness. The spatial domain—both visual-perceptual and logical—is a persistent failure point in all tested models.

Self-reflective, explicit thinking improves atomic step coverage but increases hallucination via verbosity and unnecessary inference (“overthinking”). Restricting CoT length may reduce reasoning errors. Interpretability via CoT granularity is contingent on robust alignment and hallucination tracking; as such, future directions include (a) constrained CoT generation, (b) targeted spatial error mitigation, and (c) development of multi-judge LLM ensembles to reduce judge-model bias in evaluation.

MM-THEBench stands as the most detailed and systematically annotated resource for stepwise evaluation of reasoning MLLMs, with a cognitive-science-grounded taxonomy, cross-model and cross-task coverage, and reproducible, automation-friendly evaluation framework (Huang et al., 30 Jan 2026).

7. Significance and Impact on Multimodal Reasoning Research

MM-THEBench provides a reference standard for hallucination diagnostics in the age of reasoning MLLMs. By exposing the divergence between answer-level performance and the fidelity of the underlying CoT, it motivates research on more trustworthy model “thinking” and facilitates rigorous ablation, error analysis, and targeted architectural refinement. Its fine-grained, cognitive-dimension-aligned rubric and step-level reasoning trace annotations create a template for future work in controllable reasoning length, spatial-grounded reasoning modules, and robust, scalable LLM-based evaluation. MM-THEBench's resource provides the necessary infrastructure for systematically closing the gap between opaque final answers and interpretable, hallucination-minimized reasoning in multimodal AI (Huang et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MM-THEBench.