Adversarial Evasion Attacks
- Adversarial evasion attacks are techniques that modify input data to mislead machine learning systems without noticeable changes to human observers.
- They typically use gradient-based methods to create imperceptible perturbations that exploit vulnerabilities in model architectures.
- Implementing defenses like adversarial training and detection mechanisms can significantly improve model robustness against these attacks.
Summary of “M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding”
- Dataset Design
- Size & Modalities
- 1,079 medical images with one QA pair each.
- Covers 24 examination types/modalities grouped into six major categories:
- Ophthalmic (e.g., slit-lamp photography, fundus photography, OCT, OCTA, SLO, FFA)
- Radiology (X-ray, CT, MRI, nuclear medicine)
- Endoscopy (laparoscopy, colonoscopy, gastroscopy, capsule endoscopy, bronchoscopy, ENT endoscopy, fetoscopy)
- Microscopy (histology, cytology, fluorescence microscopy)
- Ultrasound (B-scan, obstetric, musculoskeletal, carotid, liver)
- Surface inspections (dermoscopy, intraoral)
- Task Difficulty Levels
- QA pairs span 13 task types ranging from low-level perception (e.g., image-quality assessment) to high-level clinical reasoning (e.g., causal inference, action planning).
- Although all 1,079 cases are used in aggregate evaluation, tasks are stratified by difficulty in analysis (perceptual vs. knowledge-based).
- Annotation Protocol
- Formats: single‐choice, multiple‐choice, true/false, short answer.
- Inference-driven questions (e.g., “What might be the cause…?”) to introduce hierarchical difficulty.
- 2. AI‐human calibration: three MLLMs independently answer each pair; any disagreement triggers review by an expert physician; final sanity checks on all samples.
- 3. CoT annotation workflow:
- a. MLLM‐based first draft of reasoning following a four‐step clinical structure:
- 1) Identify modality/exam type
- 2) Describe key visual features
- 3) Draw diagnostic conclusion
- 4) Additional analysis (e.g., treatment, causality).
- b. Multi‐stage human‐AI review: student reviewer → automated multi‐model check → expert adjudication on flagged steps → consensus meetings → final expert read‐through.
- Task Suite M3CoTBench defines 13 tasks. For each task we summarize clinical goal, input, output, and relative difficulty.
| Task | Clinical Objective | Input Format | Output Format | Difficulty |
|---|---|---|---|---|
| 1. Examination Type | Recognize imaging modality | Image + prompt | Single‐choice | Low |
| 2. Image Quality | Judge diagnostic adequacy | Image | True/False or compare | Low |
| 3. Recognition | Identify structure, cell type, instrument | Image + prompt | Single/multi‐choice | Low–Mid |
| 4. Referring Recognition | Identify specified region | Image + region or mask | Single‐choice | Mid |
| 5. Localization | Localize lesions or anatomy | Image | Short answer (e.g., “upper lobe”) | Mid |
| 6. Counting | Count discrete objects (cells, tools, polyps) | Image | Short answer (integer) | Mid |
| 7. Diagnosis | Infer disease or abnormality | Image + prompt | Single/multi‐choice | Mid–High |
| 8. Grading | Assess severity/stage (e.g., DR level) | Image | Single‐choice | Mid–High |
| 9. Symptom | Identify clinical signs from image | Image + prompt | Short answer/string | High |
| 10. Clinical Action | Recommend next steps (e.g., treatment) | Image + multi‐choice | Single‐choice | High |
| 11. Prediction | Estimate progression or risk | Image + prompt | Short answer/string | High |
| 12. Function | Interpret organ/machine function | Image + prompt | Short answer/string | High |
| 13. Causal Reasoning | Identify etiology or cause | Image + multi‐choice | Multi‐choice | High |
- Chain-of-Thought Evaluation Metrics M3CoTBench evaluates CoT on four dimensions aligned with clinical reasoning requirements:
3.1 Correctness
- Measures alignment between model‐generated steps and gold steps . Since multiple valid reasoning paths exist, choose the reference maximizing overlap.
- Average Precision and Recall:
3.2 Efficiency
- Reflects how many correct reasoning steps a model produces per unit time, and the added latency due to CoT.
Larger (↑) indicates more accurate steps per second. Larger (↓) indicates higher latency overhead.
3.3 Impact
- Quantifies whether CoT improves final‐answer accuracy.
where is accuracy with CoT and without CoT.
3.4 Consistency
- Measures how structurally stable reasoning paths are within the same task. Represent each path as an ordered sequence of step categories.
1) Select canonical path
2) Task‐level consistency
Higher indicates more uniform reasoning structure.
- Benchmarking Protocol
4.1 Experimental Setup
- Preprocessing: Retain original image resolutions. Pairwise tasks (e.g., comparison) simply concatenate images.
- Prompting:
- CoT prompt: “Please generate a step-by-step answer, including all intermediate reasoning steps, and provide the final answer at the end.”
- Direct prompt: “Please directly provide the final answer without any additional output.”
- Inference settings: batch size = 1, temperature = 0.1.
- Evaluation of answers and reasoning steps uses GPT-4o, Llama-3.3-70B-Instruct-Turbo, and Gemini-2.5 Pro.
4.2 Evaluated Models | Category | Model (size) | |-----------------------|----------------------------------------------------------| | Open-source MLLMs | LLaVA-CoT; InternVL3.5 (8B, 30B); Qwen3-VL-Instruct (8B, 30B); Qwen3-VL-Thinking (8B, 30B) | | Closed-source MLLMs | GPT-4.1; GPT-5; Gemini 2.5 Pro; Claude-Sonnet-4.5 | | Medical-specific MLLMs| LLaVA-Med (7B); HuatuoGPT-Vision (7B); HealthGPT (3.8B); Lingshu (7B, 32B); MedGemma (4B, 27B) |
- Quantitative Results & Analysis Table 1 summarizes overall performance across all tasks (average over 1,079 samples):
Table 1: M3CoTBench overall metrics (↑ better unless marked ↓) | Model | F1 (%) | P (%) | R (%) | Acc_dir (%) | Acc_step (%) | I (%) | E (steps/s) | L (s) ↓ | C_path (%) | |---------------------------------------|--------|-------|-------|-------------|--------------|---------|-------------|---------|------------| | Open-source | | * LLaVA-CoT | 49.8 | 54.1 | 46.2 | 40.1 | 36.8 | –3.3 | 0.06 | 1.56 | 77.0 | | * InternVL3.5-8B | 56.5 | 60.6 | 52.9 | 56.8 | 53.6 | –3.2 | 0.10 | 18.3 | 71.7 | | * InternVL3.5-30B | 59.4 | 62.2 | 56.9 | 63.8 | 57.6 | –6.2 | 0.03 | 16.7 | 76.3 | | * Qwen3-VL-Instruct-8B | 55.2 | 52.7 | 57.8 | 51.3 | 46.6 | –4.7 | 0.04 | 93.9 | 82.7 | | * Qwen3-VL-Instruct-30B | 59.2 | 56.1 | 62.5 | 54.6 | 51.4 | –3.2 | 0.03 | 35.6 | 83.0 | | * Qwen3-VL-Thinking-8B | 59.9 | 59.8 | 59.9 | 48.3 | 52.8 | +4.5 | 0.02 | 2.79 | 76.9 | | * Qwen3-VL-Thinking-30B | 62.2 | 63.3 | 61.0 | 51.9 | 55.5 | +3.6 | 0.02 | 1.15 | 76.0 | | Closed-source | | * GPT-4.1 | 60.8 | 58.3 | 63.4 | 56.8 | 58.0 | +1.2 | 0.17 | 5.08 | 81.3 | | * GPT-5 | 55.1 | 64.2 | 48.3 | 58.8 | 58.3 | –0.5 | 0.06 | 1.10 | 65.4 | | * Gemini 2.5 Pro | 66.1 | 62.5 | 70.1 | 60.2 | 60.1 | –0.2 | 0.10 | 1.52 | 82.0 | | * Claude-Sonnet-4.5 | 56.5 | 53.6 | 59.7 | 51.3 | 51.1 | –0.2 | 0.15 | 2.69 | 85.2 | | Medical-specific | | * LLaVA-Med (7B) | 30.5 | 36.3 | 26.3 | 29.4 | 29.3 | –0.1 | 0.35 | 3.22 | 72.7 | | * HuatuoGPT-Vision (7B) | 49.5 | 51.2 | 47.9 | 41.9 | 34.9 | –7.0 | 0.21 | 5.92 | 73.2 | | * HealthGPT (3.8B) | 32.6 | 47.3 | 24.8 | 44.1 | 42.0 | –2.1 | 0.06 | 15.4 | 67.7 | | * Lingshu-7B | 57.6 | 64.0 | 52.3 | 50.0 | 42.1 | –7.9 | 0.30 | 8.37 | 74.8 | | * Lingshu-32B | 59.2 | 65.7 | 53.8 | 51.8 | 45.0 | –6.8 | 0.21 | 10.9 | 71.5 | | * MedGemma-4B | 48.1 | 50.3 | 46.1 | 43.3 | 41.3 | –2.0 | 0.05 | 20.6 | 74.0 | | * MedGemma-27B | 51.0 | 48.3 | 53.8 | 46.1 | 45.9 | –0.2 | 0.03 | 23.7 | 82.6 |
Key findings:
- CoT does not uniformly improve accuracy in medical VQA; it can degrade performance if reasoning distracts from visual cues (negative for many).
- Models pre-trained or prompted “for thinking” (e.g., Qwen3-VL-Thinking) gain modestly from CoT ().
- Closed-source models generally show stronger instruction compliance and balanced P/R/F1, leading to higher correctness and consistency.
- Medical-specific MLLMs often underperform in CoT alignment, emphasizing domain knowledge over explicit stepwise rationales.
5.1 Error Analysis & Difficulty Breakdown
- Common error modes in CoT:
- Omission or misweighting of decisive diagnostic features.
- Vision–language grounding drift during verbalization.
- Accumulation of early mistakes through the chain.
Difficulty‐level trends (perceptual vs. reasoning tasks) indicate larger performance drops on high-level tasks (causal, action planning) under CoT prompting.
- Qualitative Examples
6.1 Case Study 1 (Cell Type Classification) Q: “True or False: The cell shown is a lymphocyte.” Gold CoT steps: 1) Hematology/cytology modality 2) Identify bilobed nucleus + eosinophilic granules 3) Conclude “eosinophil” Direct answer: False (correct) CoT answer: True (incorrect)
GPT-generated CoT misfocused on generic lymphocyte features (scant cytoplasm, round nucleus) instead of key granularity, illustrating how CoT can amplify misinterpretation.
6.2 Case Study 2 (Ophthalmic Treatment Planning) Q: “Which is first-line treatment? A) Surgery B) Topical steroids + dilators C) No treatment D) Laser therapy.” Gold CoT: anterior uveitis → treat with steroids + dilators (B). MedGemma-27B CoT: misdiagnosed angle-closure glaucoma → selected D. Direct answer: B (correct); CoT answer: D. Highlights how erroneous intermediate inference can override an otherwise correct direct response.
- Limitations & Future Directions
7.1 Limitations
- Annotation: occasional dataset label inconsistencies and unrealistically precise disease labels.
- Subjectivity: variation in phrasing can affect automated matching.
- Evaluation bias: correctness and consistency judged by LLMs (GPT-4o, Gemini 2.5) without human cross-validation.
- No inter-annotator agreement statistics or multiple runs / confidence intervals.
- Limited prompt ablations and hyperparameter explorations.
7.2 Future Directions
- Introduce human‐validated scoring for CoT steps and answers.
- Report inter‐annotator agreement and statistical significance (multiple runs, confidence intervals).
- Explore prompt engineering and adaptive CoT elicitation (e.g., selectively deeper reasoning on high-difficulty items).
- Extend to multi‐image clinical scenarios (e.g., series MRI).
- Incorporate feedback loops where expert revisions refine model reasoning during training.
M3CoTBench provides a structured framework to quantify not only “what” a medical MLLM predicts, but “how” it reasons, fostering development of transparent and clinically trustworthy AI systems.