Papers
Topics
Authors
Recent
2000 character limit reached

M3CoTBench: Medical CoT Benchmark

Updated 20 January 2026
  • M3CoTBench is a benchmark that systematically assesses chain-of-thought reasoning in multimodal language models for medical image analysis.
  • It features a curated dataset of 1,079 QA instances across 24 imaging modalities, refined through automated generation and expert validation.
  • The evaluation protocol measures correctness, efficiency, and consistency, exposing both improvements and challenges in clinical reasoning.

M3CoTBench is a benchmark designed to systematically assess chain-of-thought (CoT) reasoning in multimodal LLMs (MLLMs) for medical image understanding. By incentivizing step-by-step intermediate reasoning rather than direct answer prediction, M3CoTBench aligns evaluation protocols with the sequential and multifaceted nature of clinical decision processes, providing multi-dimensional insights into model transparency, reliability, and clinical interpretability (Jiang et al., 13 Jan 2026).

1. Dataset Structure and Curation

M3CoTBench comprises 1,079 QA instances spanning 24 medical imaging modalities and examination types, including X-ray, CT, MRI, OCT, endoscopy, histology, cytology, ultrasound, dermoscopy, and intraoral exams. Each instance is structured as a single image–question–answer triplet, sampled to ensure diversity and implicit stratification by difficulty.

Questions employ four formats: single-choice, multiple-choice, true/false, and short-answer. Task difficulty ranges from basic perceptual challenges to high-level inference and clinical decision support.

Images and questions are sourced from 55 public datasets, selected for diversity, typicality, class balance, and legal compliance. The curation pipeline involves automatic QA generation and rewriting using GPT-4o, followed by multi-phase calibration: independent annotation by three MLLMs, expert clinician adjudication, and a final human review sweep for consistency and validity.

2. Task Suite and Difficulty Stratification

M3CoTBench defines 13 tasks that form a graduated scale from basic perceptual operations (Tier 1) to multi-step clinical reasoning (Tier 4). Tasks are delineated by clinical objective, input modality, expected output, and relative difficulty:

Task Output Type Difficulty
Modality Recognition Name of modality Low
Image Quality Good/Bad/Compare Low
Recognition Label Low
Referring Recognition Label Low–Med
Counting Integer Low–Med
Localization Location Text Med
Diagnosis Disease Name Med
Grading Grade/Category Med
Symptom Identification Symptom Text Med
Action Planning Action Option High
Prediction Prognosis Text High
Functional Understanding Functional Text High
Causal Reasoning Cause List High

Tasks range from image modality recognition and lesion counting to causal reasoning and treatment planning, thereby exposing MLLMs to the full spectrum of medical image-driven decision complexity.

3. Reasoning Step Annotation and Calibration Protocols

Each instance is annotated to reflect four clinical reasoning steps: (1) imaging modality identification, (2) key visual feature description, (3) diagnostic or recognition conclusion, and (4) advanced clinical analysis (e.g., etiology, treatment, prediction).

Initial step annotation is performed by GPT-4o and Gemini-2.5 Pro. Subsequent reviews involve multi-stage student assessment, automated model consistency checks, targeted expert evaluation of flagged inconsistencies, consensus meetings for edge cases, and thorough final expert verification. No inter-annotator agreement statistics are reported; the annotation pipeline operates sequentially with expert oversight to maximize validity and consistency.

4. Evaluation Metrics

M3CoTBench defines four metrics tailored to CoT reasoning:

  • Correctness: Step-level precision, recall, and F₁-score, calculated as mean overlap between generated steps (R(i)R^{(i)}) and reference expert paths (A(i)A_*^{(i)}):

P=1Ni=1NR(i)A(i)R(i),R=1Ni=1NR(i)A(i)A(i).\mathrm{P} = \frac{1}{N} \sum_{i=1}^N \frac{|R^{(i)} \cap A_*^{(i)}|}{|R^{(i)}|},\quad \mathrm{R} = \frac{1}{N} \sum_{i=1}^N \frac{|R^{(i)} \cap A_*^{(i)}|}{|A_*^{(i)}|}.

  • Efficiency: Correct steps per second (EE) and latency overhead per example (LL), measuring the trade-off between reasoning transparency and inference speed.
  • Impact: Relative change in answer accuracy when CoT reasoning is applied:

I=AccstepAccdirectI = \mathrm{Acc}_{\mathrm{step}} - \mathrm{Acc}_{\mathrm{direct}}

capturing whether CoT steps substantively improve correctness.

  • Consistency: Structural stability of reasoning paths per task, derived from the longest common subsequence similarity between generated step sequences and canonical paths.

Evaluation is performed using adjudications from GPT-4o, Gemini 2.5 Pro, and custom LLM prompts. Answer accuracy utilizes both GPT-4o and Llama-3.3-70B. Consistency is computed over all instances in each task and averaged across all 13 tasks.

5. Model Benchmarking Protocol

The protocol benchmarks a diverse set of models, including open-source, closed-source, and medical-specialized MLLMs:

  • Open-source: LLava-CoT, InternVL3.5 (8B/30B), Qwen3-VL-Instruct (8B/30B), Qwen3-VL-Thinking (8B/30B)
  • Closed-source: GPT-4.1, GPT-5, Gemini 2.5 Pro, Claude-Sonnet-4.5
  • Medical-specialized: LLaVA-Med (7B), HuatuoGPT-Vision (7B), HealthGPT (3.8B), Lingshu (7B/32B), MedGemma (4B/27B)

Inference employs batch size of 1 and temperature of 0.1; open models are run locally on AMD GPUs, while closed-source models use respective APIs. Direct prompt requests only the final answer, whereas CoT prompt requires sequential reasoning steps culminating with the final decision.

6. Results and Error Analysis

Quantitative results reveal that no single model achieves dominance across all evaluation metrics. Closed-source models generally exhibit higher reasoning consistency. Larger “Thinking” model variants typically outperform “Instruct” variants in F₁ and step alignment. Efficiency scores are adversely affected by increased inference overhead among large closed-source models.

Aggregate performance for selected models (abbreviated):

Model F₁ Acc_direct Acc_step I E L C_path
Qwen3-VL-Thinking (30B) 62.15 51.90 55.47 +3.57 0.02 1.15 76.02
Gemini 2.5 Pro 66.07 60.24 60.06 –0.18 0.10 1.52 82.00
GPT-4.1 60.76 56.77 57.97 +1.22 0.17 5.08 81.31

Key observations:

  • CoT reasoning sometimes decreases accuracy (negative I) in perceptual tasks, but can offer moderate gains (positive I) for complex reasoning challenges.
  • Error modes in CoT are illuminated by task difficulty: perceptual tasks often accrue unnecessary overhead; high-level reasoning reveals occasional (but limited) benefit from CoT structure.

7. Limitations and Prospects

The annotation pipeline relies on public dataset labels, which may contain errors or excessive specificity. Human expert validation is performed sequentially, without inter-annotator agreement statistics. Model output evaluation is performed primarily by LLMs, opening the possibility of scoring bias.

Experimental design constraints include no reported confidence intervals, no statistical significance analysis, and limited ablation of prompt strategies. Only single-run evaluations are presented.

Future work directions include expanded human validation for edge cases, finer-grained difficulty stratification with explicit tier labeling, multi-anchor calibration for inter-annotator agreement scoring, systematic prompt ablations, and integration of temporal or multi-view imaging scenarios for real-world clinical deployment.

Summary

M3CoTBench constitutes an authoritative, multi-dimensional framework for assessing chain-of-thought reasoning in medical MLLMs. It combines a diverse, expertise-calibrated dataset, rigorous annotation protocols, and targeted metrics for correctness, efficiency, impact, and consistency. The benchmark demonstrates that while CoT can promote interpretability, it does not consistently improve predictive accuracy, and may introduce distinct error signatures that necessitate robust, clinically validated reasoning architectures and further methodological innovation (Jiang et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M3CoTBench Benchmark.