Adversarial Evasion Attacks
- Adversarial evasion attacks are techniques that modify input data to mislead machine learning systems without noticeable changes to human observers.
- They typically use gradient-based methods to create imperceptible perturbations that exploit vulnerabilities in model architectures.
- Implementing defenses like adversarial training and detection mechanisms can significantly improve model robustness against these attacks.
Summary of “M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding”
- Dataset Design
- Size & Modalities
- 1,079 medical images with one QA pair each.
- Covers 24 examination types/modalities grouped into six major categories:
- Ophthalmic (e.g., slit-lamp photography, fundus photography, OCT, OCTA, SLO, FFA)
- Radiology (X-ray, CT, MRI, nuclear medicine)
- Endoscopy (laparoscopy, colonoscopy, gastroscopy, capsule endoscopy, bronchoscopy, ENT endoscopy, fetoscopy)
- Microscopy (histology, cytology, fluorescence microscopy)
- Ultrasound (B-scan, obstetric, musculoskeletal, carotid, liver)
- Surface inspections (dermoscopy, intraoral)
- Task Difficulty Levels
- QA pairs span 13 task types ranging from low-level perception (e.g., image-quality assessment) to high-level clinical reasoning (e.g., causal inference, action planning).
- Although all 1,079 cases are used in aggregate evaluation, tasks are stratified by difficulty in analysis (perceptual vs. knowledge-based).
- Annotation Protocol
- Formats: single‐choice, multiple‐choice, true/false, short answer.
- Inference-driven questions (e.g., “What might be the cause…?”) to introduce hierarchical difficulty.
- 2. AI‐human calibration: three MLLMs independently answer each pair; any disagreement triggers review by an expert physician; final sanity checks on all samples.
- 3. CoT annotation workflow:
- a. MLLM‐based first draft of reasoning following a four‐step clinical structure:
- 1) Identify modality/exam type
- 2) Describe key visual features
- 3) Draw diagnostic conclusion
- 4) Additional analysis (e.g., treatment, causality).
- b. Multi‐stage human‐AI review: student reviewer → automated multi‐model check → expert adjudication on flagged steps → consensus meetings → final expert read‐through.
- Task Suite M3CoTBench defines 13 tasks. For each task we summarize clinical goal, input, output, and relative difficulty.
| Task | Clinical Objective | Input Format | Output Format | Difficulty |
|---|---|---|---|---|
| 1. Examination Type | Recognize imaging modality | Image + prompt | Single‐choice | Low |
| 2. Image Quality | Judge diagnostic adequacy | Image | True/False or compare | Low |
| 3. Recognition | Identify structure, cell type, instrument | Image + prompt | Single/multi‐choice | Low–Mid |
| 4. Referring Recognition | Identify specified region | Image + region or mask | Single‐choice | Mid |
| 5. Localization | Localize lesions or anatomy | Image | Short answer (e.g., “upper lobe”) | Mid |
| 6. Counting | Count discrete objects (cells, tools, polyps) | Image | Short answer (integer) | Mid |
| 7. Diagnosis | Infer disease or abnormality | Image + prompt | Single/multi‐choice | Mid–High |
| 8. Grading | Assess severity/stage (e.g., DR level) | Image | Single‐choice | Mid–High |
| 9. Symptom | Identify clinical signs from image | Image + prompt | Short answer/string | High |
| 10. Clinical Action | Recommend next steps (e.g., treatment) | Image + multi‐choice | Single‐choice | High |
| 11. Prediction | Estimate progression or risk | Image + prompt | Short answer/string | High |
| 12. Function | Interpret organ/machine function | Image + prompt | Short answer/string | High |
| 13. Causal Reasoning | Identify etiology or cause | Image + multi‐choice | Multi‐choice | High |
- Chain-of-Thought Evaluation Metrics M3CoTBench evaluates CoT on four dimensions aligned with clinical reasoning requirements:
3.1 Correctness
- Measures alignment between model‐generated steps and gold steps . Since multiple valid reasoning paths exist, choose the reference maximizing overlap.
- Average Precision and Recall:
3.2 Efficiency
- Reflects how many correct reasoning steps a model produces per unit time, and the added latency due to CoT.
Larger (↑) indicates more accurate steps per second. Larger (↓) indicates higher latency overhead.
3.3 Impact
- Quantifies whether CoT improves final‐answer accuracy.
where is accuracy with CoT and without CoT.
3.4 Consistency
- Measures how structurally stable reasoning paths are within the same task. Represent each path 0 as an ordered sequence of step categories.
1) Select canonical path
1
2) Task‐level consistency
2
Higher 3 indicates more uniform reasoning structure.
- Benchmarking Protocol
4.1 Experimental Setup
- Preprocessing: Retain original image resolutions. Pairwise tasks (e.g., comparison) simply concatenate images.
- Prompting:
- CoT prompt: “Please generate a step-by-step answer, including all intermediate reasoning steps, and provide the final answer at the end.”
- Direct prompt: “Please directly provide the final answer without any additional output.”
- Inference settings: batch size = 1, temperature = 0.1.
- Evaluation of answers and reasoning steps uses GPT-4o, Llama-3.3-70B-Instruct-Turbo, and Gemini-2.5 Pro.
4.2 Evaluated Models | Category | Model (size) | |-----------------------|----------------------------------------------------------| | Open-source MLLMs | LLaVA-CoT; InternVL3.5 (8B, 30B); Qwen3-VL-Instruct (8B, 30B); Qwen3-VL-Thinking (8B, 30B) | | Closed-source MLLMs | GPT-4.1; GPT-5; Gemini 2.5 Pro; Claude-Sonnet-4.5 | | Medical-specific MLLMs| LLaVA-Med (7B); HuatuoGPT-Vision (7B); HealthGPT (3.8B); Lingshu (7B, 32B); MedGemma (4B, 27B) |
- Quantitative Results & Analysis Table 1 summarizes overall performance across all tasks (average over 1,079 samples):
Table 1: M3CoTBench overall metrics (↑ better unless marked ↓) | Model | F1 (%) | P (%) | R (%) | Acc_dir (%) | Acc_step (%) | I (%) | E (steps/s) | L (s) ↓ | C_path (%) | |---------------------------------------|--------|-------|-------|-------------|--------------|---------|-------------|---------|------------| | Open-source | | * LLaVA-CoT | 49.8 | 54.1 | 46.2 | 40.1 | 36.8 | –3.3 | 0.06 | 1.56 | 77.0 | | * InternVL3.5-8B | 56.5 | 60.6 | 52.9 | 56.8 | 53.6 | –3.2 | 0.10 | 18.3 | 71.7 | | * InternVL3.5-30B | 59.4 | 62.2 | 56.9 | 63.8 | 57.6 | –6.2 | 0.03 | 16.7 | 76.3 | | * Qwen3-VL-Instruct-8B | 55.2 | 52.7 | 57.8 | 51.3 | 46.6 | –4.7 | 0.04 | 93.9 | 82.7 | | * Qwen3-VL-Instruct-30B | 59.2 | 56.1 | 62.5 | 54.6 | 51.4 | –3.2 | 0.03 | 35.6 | 83.0 | | * Qwen3-VL-Thinking-8B | 59.9 | 59.8 | 59.9 | 48.3 | 52.8 | +4.5 | 0.02 | 2.79 | 76.9 | | * Qwen3-VL-Thinking-30B | 62.2 | 63.3 | 61.0 | 51.9 | 55.5 | +3.6 | 0.02 | 1.15 | 76.0 | | Closed-source | | * GPT-4.1 | 60.8 | 58.3 | 63.4 | 56.8 | 58.0 | +1.2 | 0.17 | 5.08 | 81.3 | | * GPT-5 | 55.1 | 64.2 | 48.3 | 58.8 | 58.3 | –0.5 | 0.06 | 1.10 | 65.4 | | * Gemini 2.5 Pro | 66.1 | 62.5 | 70.1 | 60.2 | 60.1 | –0.2 | 0.10 | 1.52 | 82.0 | | * Claude-Sonnet-4.5 | 56.5 | 53.6 | 59.7 | 51.3 | 51.1 | –0.2 | 0.15 | 2.69 | 85.2 | | Medical-specific | | * LLaVA-Med (7B) | 30.5 | 36.3 | 26.3 | 29.4 | 29.3 | –0.1 | 0.35 | 3.22 | 72.7 | | * HuatuoGPT-Vision (7B) | 49.5 | 51.2 | 47.9 | 41.9 | 34.9 | –7.0 | 0.21 | 5.92 | 73.2 | | * HealthGPT (3.8B) | 32.6 | 47.3 | 24.8 | 44.1 | 42.0 | –2.1 | 0.06 | 15.4 | 67.7 | | * Lingshu-7B | 57.6 | 64.0 | 52.3 | 50.0 | 42.1 | –7.9 | 0.30 | 8.37 | 74.8 | | * Lingshu-32B | 59.2 | 65.7 | 53.8 | 51.8 | 45.0 | –6.8 | 0.21 | 10.9 | 71.5 | | * MedGemma-4B | 48.1 | 50.3 | 46.1 | 43.3 | 41.3 | –2.0 | 0.05 | 20.6 | 74.0 | | * MedGemma-27B | 51.0 | 48.3 | 53.8 | 46.1 | 45.9 | –0.2 | 0.03 | 23.7 | 82.6 |
Key findings:
- CoT does not uniformly improve accuracy in medical VQA; it can degrade performance if reasoning distracts from visual cues (negative 4 for many).
- Models pre-trained or prompted “for thinking” (e.g., Qwen3-VL-Thinking) gain modestly from CoT (5).
- Closed-source models generally show stronger instruction compliance and balanced P/R/F1, leading to higher correctness and consistency.
- Medical-specific MLLMs often underperform in CoT alignment, emphasizing domain knowledge over explicit stepwise rationales.
5.1 Error Analysis & Difficulty Breakdown
- Common error modes in CoT:
- Omission or misweighting of decisive diagnostic features.
- Vision–language grounding drift during verbalization.
- Accumulation of early mistakes through the chain.
Difficulty‐level trends (perceptual vs. reasoning tasks) indicate larger performance drops on high-level tasks (causal, action planning) under CoT prompting.
- Qualitative Examples
6.1 Case Study 1 (Cell Type Classification) Q: “True or False: The cell shown is a lymphocyte.” Gold CoT steps: 1) Hematology/cytology modality 2) Identify bilobed nucleus + eosinophilic granules 3) Conclude “eosinophil” Direct answer: False (correct) CoT answer: True (incorrect)
GPT-generated CoT misfocused on generic lymphocyte features (scant cytoplasm, round nucleus) instead of key granularity, illustrating how CoT can amplify misinterpretation.
6.2 Case Study 2 (Ophthalmic Treatment Planning) Q: “Which is first-line treatment? A) Surgery B) Topical steroids + dilators C) No treatment D) Laser therapy.” Gold CoT: anterior uveitis → treat with steroids + dilators (B). MedGemma-27B CoT: misdiagnosed angle-closure glaucoma → selected D. Direct answer: B (correct); CoT answer: D. Highlights how erroneous intermediate inference can override an otherwise correct direct response.
- Limitations & Future Directions
7.1 Limitations
- Annotation: occasional dataset label inconsistencies and unrealistically precise disease labels.
- Subjectivity: variation in phrasing can affect automated matching.
- Evaluation bias: correctness and consistency judged by LLMs (GPT-4o, Gemini 2.5) without human cross-validation.
- No inter-annotator agreement statistics or multiple runs / confidence intervals.
- Limited prompt ablations and hyperparameter explorations.
7.2 Future Directions
- Introduce human‐validated scoring for CoT steps and answers.
- Report inter‐annotator agreement and statistical significance (multiple runs, confidence intervals).
- Explore prompt engineering and adaptive CoT elicitation (e.g., selectively deeper reasoning on high-difficulty items).
- Extend to multi‐image clinical scenarios (e.g., series MRI).
- Incorporate feedback loops where expert revisions refine model reasoning during training.
M3CoTBench provides a structured framework to quantify not only “what” a medical MLLM predicts, but “how” it reasons, fostering development of transparent and clinically trustworthy AI systems.