M3CoTBench: Medical CoT Benchmark
- M3CoTBench is a benchmark that systematically assesses chain-of-thought reasoning in multimodal language models for medical image analysis.
- It features a curated dataset of 1,079 QA instances across 24 imaging modalities, refined through automated generation and expert validation.
- The evaluation protocol measures correctness, efficiency, and consistency, exposing both improvements and challenges in clinical reasoning.
M3CoTBench is a benchmark designed to systematically assess chain-of-thought (CoT) reasoning in multimodal LLMs (MLLMs) for medical image understanding. By incentivizing step-by-step intermediate reasoning rather than direct answer prediction, M3CoTBench aligns evaluation protocols with the sequential and multifaceted nature of clinical decision processes, providing multi-dimensional insights into model transparency, reliability, and clinical interpretability (Jiang et al., 13 Jan 2026).
1. Dataset Structure and Curation
M3CoTBench comprises 1,079 QA instances spanning 24 medical imaging modalities and examination types, including X-ray, CT, MRI, OCT, endoscopy, histology, cytology, ultrasound, dermoscopy, and intraoral exams. Each instance is structured as a single image–question–answer triplet, sampled to ensure diversity and implicit stratification by difficulty.
Questions employ four formats: single-choice, multiple-choice, true/false, and short-answer. Task difficulty ranges from basic perceptual challenges to high-level inference and clinical decision support.
Images and questions are sourced from 55 public datasets, selected for diversity, typicality, class balance, and legal compliance. The curation pipeline involves automatic QA generation and rewriting using GPT-4o, followed by multi-phase calibration: independent annotation by three MLLMs, expert clinician adjudication, and a final human review sweep for consistency and validity.
2. Task Suite and Difficulty Stratification
M3CoTBench defines 13 tasks that form a graduated scale from basic perceptual operations (Tier 1) to multi-step clinical reasoning (Tier 4). Tasks are delineated by clinical objective, input modality, expected output, and relative difficulty:
| Task | Output Type | Difficulty |
|---|---|---|
| Modality Recognition | Name of modality | Low |
| Image Quality | Good/Bad/Compare | Low |
| Recognition | Label | Low |
| Referring Recognition | Label | Low–Med |
| Counting | Integer | Low–Med |
| Localization | Location Text | Med |
| Diagnosis | Disease Name | Med |
| Grading | Grade/Category | Med |
| Symptom Identification | Symptom Text | Med |
| Action Planning | Action Option | High |
| Prediction | Prognosis Text | High |
| Functional Understanding | Functional Text | High |
| Causal Reasoning | Cause List | High |
Tasks range from image modality recognition and lesion counting to causal reasoning and treatment planning, thereby exposing MLLMs to the full spectrum of medical image-driven decision complexity.
3. Reasoning Step Annotation and Calibration Protocols
Each instance is annotated to reflect four clinical reasoning steps: (1) imaging modality identification, (2) key visual feature description, (3) diagnostic or recognition conclusion, and (4) advanced clinical analysis (e.g., etiology, treatment, prediction).
Initial step annotation is performed by GPT-4o and Gemini-2.5 Pro. Subsequent reviews involve multi-stage student assessment, automated model consistency checks, targeted expert evaluation of flagged inconsistencies, consensus meetings for edge cases, and thorough final expert verification. No inter-annotator agreement statistics are reported; the annotation pipeline operates sequentially with expert oversight to maximize validity and consistency.
4. Evaluation Metrics
M3CoTBench defines four metrics tailored to CoT reasoning:
- Correctness: Step-level precision, recall, and F₁-score, calculated as mean overlap between generated steps () and reference expert paths ():
- Efficiency: Correct steps per second () and latency overhead per example (), measuring the trade-off between reasoning transparency and inference speed.
- Impact: Relative change in answer accuracy when CoT reasoning is applied:
capturing whether CoT steps substantively improve correctness.
- Consistency: Structural stability of reasoning paths per task, derived from the longest common subsequence similarity between generated step sequences and canonical paths.
Evaluation is performed using adjudications from GPT-4o, Gemini 2.5 Pro, and custom LLM prompts. Answer accuracy utilizes both GPT-4o and Llama-3.3-70B. Consistency is computed over all instances in each task and averaged across all 13 tasks.
5. Model Benchmarking Protocol
The protocol benchmarks a diverse set of models, including open-source, closed-source, and medical-specialized MLLMs:
- Open-source: LLava-CoT, InternVL3.5 (8B/30B), Qwen3-VL-Instruct (8B/30B), Qwen3-VL-Thinking (8B/30B)
- Closed-source: GPT-4.1, GPT-5, Gemini 2.5 Pro, Claude-Sonnet-4.5
- Medical-specialized: LLaVA-Med (7B), HuatuoGPT-Vision (7B), HealthGPT (3.8B), Lingshu (7B/32B), MedGemma (4B/27B)
Inference employs batch size of 1 and temperature of 0.1; open models are run locally on AMD GPUs, while closed-source models use respective APIs. Direct prompt requests only the final answer, whereas CoT prompt requires sequential reasoning steps culminating with the final decision.
6. Results and Error Analysis
Quantitative results reveal that no single model achieves dominance across all evaluation metrics. Closed-source models generally exhibit higher reasoning consistency. Larger “Thinking” model variants typically outperform “Instruct” variants in F₁ and step alignment. Efficiency scores are adversely affected by increased inference overhead among large closed-source models.
Aggregate performance for selected models (abbreviated):
| Model | F₁ | Acc_direct | Acc_step | I | E | L | C_path |
|---|---|---|---|---|---|---|---|
| Qwen3-VL-Thinking (30B) | 62.15 | 51.90 | 55.47 | +3.57 | 0.02 | 1.15 | 76.02 |
| Gemini 2.5 Pro | 66.07 | 60.24 | 60.06 | –0.18 | 0.10 | 1.52 | 82.00 |
| GPT-4.1 | 60.76 | 56.77 | 57.97 | +1.22 | 0.17 | 5.08 | 81.31 |
Key observations:
- CoT reasoning sometimes decreases accuracy (negative I) in perceptual tasks, but can offer moderate gains (positive I) for complex reasoning challenges.
- Error modes in CoT are illuminated by task difficulty: perceptual tasks often accrue unnecessary overhead; high-level reasoning reveals occasional (but limited) benefit from CoT structure.
7. Limitations and Prospects
The annotation pipeline relies on public dataset labels, which may contain errors or excessive specificity. Human expert validation is performed sequentially, without inter-annotator agreement statistics. Model output evaluation is performed primarily by LLMs, opening the possibility of scoring bias.
Experimental design constraints include no reported confidence intervals, no statistical significance analysis, and limited ablation of prompt strategies. Only single-run evaluations are presented.
Future work directions include expanded human validation for edge cases, finer-grained difficulty stratification with explicit tier labeling, multi-anchor calibration for inter-annotator agreement scoring, systematic prompt ablations, and integration of temporal or multi-view imaging scenarios for real-world clinical deployment.
Summary
M3CoTBench constitutes an authoritative, multi-dimensional framework for assessing chain-of-thought reasoning in medical MLLMs. It combines a diverse, expertise-calibrated dataset, rigorous annotation protocols, and targeted metrics for correctness, efficiency, impact, and consistency. The benchmark demonstrates that while CoT can promote interpretability, it does not consistently improve predictive accuracy, and may introduce distinct error signatures that necessitate robust, clinically validated reasoning architectures and further methodological innovation (Jiang et al., 13 Jan 2026).