Adversarial Evasion Attacks

Updated 20 January 2026

Adversarial evasion attacks are techniques that modify input data to mislead machine learning systems without noticeable changes to human observers.
They typically use gradient-based methods to create imperceptible perturbations that exploit vulnerabilities in model architectures.
Implementing defenses like adversarial training and detection mechanisms can significantly improve model robustness against these attacks.

Summary of “M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding”

Dataset Design

Size & Modalities
- 1,079 medical images with one QA pair each.
- Covers 24 examination types/modalities grouped into six major categories:
- Ophthalmic (e.g., slit-lamp photography, fundus photography, OCT, OCTA, SLO, FFA)
- Radiology (X-ray, CT, MRI, nuclear medicine)
- Endoscopy (laparoscopy, colonoscopy, gastroscopy, capsule endoscopy, bronchoscopy, ENT endoscopy, fetoscopy)
- Microscopy (histology, cytology, fluorescence microscopy)
- Ultrasound (B-scan, obstetric, musculoskeletal, carotid, liver)
- Surface inspections (dermoscopy, intraoral)
Task Difficulty Levels
- QA pairs span 13 task types ranging from low-level perception (e.g., image-quality assessment) to high-level clinical reasoning (e.g., causal inference, action planning).
- Although all 1,079 cases are used in aggregate evaluation, tasks are stratified by difficulty in analysis (perceptual vs. knowledge-based).
Annotation Protocol
- Formats: single‐choice, multiple‐choice, true/false, short answer.
- Inference-driven questions (e.g., “What might be the cause…?”) to introduce hierarchical difficulty.
- 2. AI‐human calibration: three MLLMs independently answer each pair; any disagreement triggers review by an expert physician; final sanity checks on all samples.
- 3. CoT annotation workflow:
- a. MLLM‐based first draft of reasoning following a four‐step clinical structure:
- 1) Identify modality/exam type
- 2) Describe key visual features
- 3) Draw diagnostic conclusion
- 4) Additional analysis (e.g., treatment, causality).
- b. Multi‐stage human‐AI review: student reviewer → automated multi‐model check → expert adjudication on flagged steps → consensus meetings → final expert read‐through.

Task Suite M3CoTBench defines 13 tasks. For each task we summarize clinical goal, input, output, and relative difficulty.

Task	Clinical Objective	Input Format	Output Format	Difficulty
1. Examination Type	Recognize imaging modality	Image + prompt	Single‐choice	Low
2. Image Quality	Judge diagnostic adequacy	Image	True/False or compare	Low
3. Recognition	Identify structure, cell type, instrument	Image + prompt	Single/multi‐choice	Low–Mid
4. Referring Recognition	Identify specified region	Image + region or mask	Single‐choice	Mid
5. Localization	Localize lesions or anatomy	Image	Short answer (e.g., “upper lobe”)	Mid
6. Counting	Count discrete objects (cells, tools, polyps)	Image	Short answer (integer)	Mid
7. Diagnosis	Infer disease or abnormality	Image + prompt	Single/multi‐choice	Mid–High
8. Grading	Assess severity/stage (e.g., DR level)	Image	Single‐choice	Mid–High
9. Symptom	Identify clinical signs from image	Image + prompt	Short answer/string	High
10. Clinical Action	Recommend next steps (e.g., treatment)	Image + multi‐choice	Single‐choice	High
11. Prediction	Estimate progression or risk	Image + prompt	Short answer/string	High
12. Function	Interpret organ/machine function	Image + prompt	Short answer/string	High
13. Causal Reasoning	Identify etiology or cause	Image + multi‐choice	Multi‐choice	High

Chain-of-Thought Evaluation Metrics M3CoTBench evaluates CoT on four dimensions aligned with clinical reasoning requirements:

3.1 Correctness

Measures alignment between model‐generated steps $\mathcal{R}^{(i)}$ and gold steps $\mathcal{A}_k^{(i)}$ . Since multiple valid reasoning paths exist, choose the reference $k^*$ maximizing overlap.
Average Precision and Recall:

$\text{AvgPrecision} \;=\; \frac{1}{N}\sum_{i=1}^N \frac{|\mathcal{R}^{(i)}\cap \mathcal{A}_{k^*}^{(i)}|}{|\mathcal{R}^{(i)}|}, \quad \text{AvgRecall} \;=\; \frac{1}{N}\sum_{i=1}^N \frac{|\mathcal{R}^{(i)}\cap \mathcal{A}_{k^*}^{(i)}|}{|\mathcal{A}_{k^*}^{(i)}|}.$

3.2 Efficiency

Reflects how many correct reasoning steps a model produces per unit time, and the added latency due to CoT.

$E \;=\; \sum_{i=1}^N \frac{\bigl|\mathcal{R}^{(i)}\cap \mathcal{A}_{k^*}^{(i)}\bigr|}{T_{\mathrm{CoT}}}, \quad L \;=\; \frac{T_{\mathrm{CoT}} - T_{\mathrm{direct}}}{N}.$

Larger $E$ (↑) indicates more accurate steps per second. Larger $L$ (↓) indicates higher latency overhead.

3.3 Impact

Quantifies whether CoT improves final‐answer accuracy.

$I \;=\; \mathrm{Acc}_{\mathrm{step}} - \mathrm{Acc}_{\mathrm{direct}},$

where $\mathrm{Acc}_{\mathrm{step}}$ is accuracy with CoT and $\mathrm{Acc}_{\mathrm{direct}}$ without CoT.

3.4 Consistency

Measures how structurally stable reasoning paths are within the same task. Represent each path $P_i^{(t)}$ as an ordered sequence of step categories. 1) Select canonical path

$P^{(t)} \;=\;\arg\max_{\!P}\sum_{i=1}^N \mathrm{sim}\bigl(P,P_i^{(t)}\bigr), \quad \mathrm{sim}(P,Q)\;=\;\frac{|\mathrm{LCS}(P,Q)|}{\max(|P|,|Q|)}.$

2) Task‐level consistency

$C_{\mathrm{path}}^{(t)} \;=\;\frac{1}{N}\sum_{i=1}^N \mathrm{sim}\bigl(P^{(t)},P_i^{(t)}\bigr), \quad C_{\mathrm{path}} = \frac{1}{M}\sum_{t=1}^M C_{\mathrm{path}}^{(t)},\ M=13.$

Higher $C_{\mathrm{path}}\in[0,1]$ indicates more uniform reasoning structure.

Benchmarking Protocol

4.1 Experimental Setup

Preprocessing: Retain original image resolutions. Pairwise tasks (e.g., comparison) simply concatenate images.
Prompting:
- CoT prompt: “Please generate a step-by-step answer, including all intermediate reasoning steps, and provide the final answer at the end.”
- Direct prompt: “Please directly provide the final answer without any additional output.”
Inference settings: batch size = 1, temperature = 0.1.
Evaluation of answers and reasoning steps uses GPT-4o, Llama-3.3-70B-Instruct-Turbo, and Gemini-2.5 Pro.

4.2 Evaluated Models | Category | Model (size) | |-----------------------|----------------------------------------------------------| | Open-source MLLMs | LLaVA-CoT; InternVL3.5 (8B, 30B); Qwen3-VL-Instruct (8B, 30B); Qwen3-VL-Thinking (8B, 30B) | | Closed-source MLLMs | GPT-4.1; GPT-5; Gemini 2.5 Pro; Claude-Sonnet-4.5 | | Medical-specific MLLMs| LLaVA-Med (7B); HuatuoGPT-Vision (7B); HealthGPT (3.8B); Lingshu (7B, 32B); MedGemma (4B, 27B) |

Quantitative Results & Analysis Table 1 summarizes overall performance across all tasks (average over 1,079 samples):

Table 1: M3CoTBench overall metrics (↑ better unless marked ↓) | Model | F1 (%) | P (%) | R (%) | Acc_dir (%) | Acc_step (%) | I (%) | E (steps/s) | L (s) ↓ | C_path (%) | |---------------------------------------|--------|-------|-------|-------------|--------------|---------|-------------|---------|------------| | Open-source | | * LLaVA-CoT | 49.8 | 54.1 | 46.2 | 40.1 | 36.8 | –3.3 | 0.06 | 1.56 | 77.0 | | * InternVL3.5-8B | 56.5 | 60.6 | 52.9 | 56.8 | 53.6 | –3.2 | 0.10 | 18.3 | 71.7 | | * InternVL3.5-30B | 59.4 | 62.2 | 56.9 | 63.8 | 57.6 | –6.2 | 0.03 | 16.7 | 76.3 | | * Qwen3-VL-Instruct-8B | 55.2 | 52.7 | 57.8 | 51.3 | 46.6 | –4.7 | 0.04 | 93.9 | 82.7 | | * Qwen3-VL-Instruct-30B | 59.2 | 56.1 | 62.5 | 54.6 | 51.4 | –3.2 | 0.03 | 35.6 | 83.0 | | * Qwen3-VL-Thinking-8B | 59.9 | 59.8 | 59.9 | 48.3 | 52.8 | +4.5 | 0.02 | 2.79 | 76.9 | | * Qwen3-VL-Thinking-30B | 62.2 | 63.3 | 61.0 | 51.9 | 55.5 | +3.6 | 0.02 | 1.15 | 76.0 | | Closed-source | | * GPT-4.1 | 60.8 | 58.3 | 63.4 | 56.8 | 58.0 | +1.2 | 0.17 | 5.08 | 81.3 | | * GPT-5 | 55.1 | 64.2 | 48.3 | 58.8 | 58.3 | –0.5 | 0.06 | 1.10 | 65.4 | | * Gemini 2.5 Pro | 66.1 | 62.5 | 70.1 | 60.2 | 60.1 | –0.2 | 0.10 | 1.52 | 82.0 | | * Claude-Sonnet-4.5 | 56.5 | 53.6 | 59.7 | 51.3 | 51.1 | –0.2 | 0.15 | 2.69 | 85.2 | | Medical-specific | | * LLaVA-Med (7B) | 30.5 | 36.3 | 26.3 | 29.4 | 29.3 | –0.1 | 0.35 | 3.22 | 72.7 | | * HuatuoGPT-Vision (7B) | 49.5 | 51.2 | 47.9 | 41.9 | 34.9 | –7.0 | 0.21 | 5.92 | 73.2 | | * HealthGPT (3.8B) | 32.6 | 47.3 | 24.8 | 44.1 | 42.0 | –2.1 | 0.06 | 15.4 | 67.7 | | * Lingshu-7B | 57.6 | 64.0 | 52.3 | 50.0 | 42.1 | –7.9 | 0.30 | 8.37 | 74.8 | | * Lingshu-32B | 59.2 | 65.7 | 53.8 | 51.8 | 45.0 | –6.8 | 0.21 | 10.9 | 71.5 | | * MedGemma-4B | 48.1 | 50.3 | 46.1 | 43.3 | 41.3 | –2.0 | 0.05 | 20.6 | 74.0 | | * MedGemma-27B | 51.0 | 48.3 | 53.8 | 46.1 | 45.9 | –0.2 | 0.03 | 23.7 | 82.6 |

Key findings:

CoT does not uniformly improve accuracy in medical VQA; it can degrade performance if reasoning distracts from visual cues (negative $I$ for many).
Models pre-trained or prompted “for thinking” (e.g., Qwen3-VL-Thinking) gain modestly from CoT ( $I>0$ ).
Closed-source models generally show stronger instruction compliance and balanced P/R/F1, leading to higher correctness and consistency.
Medical-specific MLLMs often underperform in CoT alignment, emphasizing domain knowledge over explicit stepwise rationales.

5.1 Error Analysis & Difficulty Breakdown

Common error modes in CoT:
1. Omission or misweighting of decisive diagnostic features.
2. Vision–language grounding drift during verbalization.
3. Accumulation of early mistakes through the chain.
Difficulty‐level trends (perceptual vs. reasoning tasks) indicate larger performance drops on high-level tasks (causal, action planning) under CoT prompting.

Qualitative Examples

6.1 Case Study 1 (Cell Type Classification) Q: “True or False: The cell shown is a lymphocyte.” Gold CoT steps: 1) Hematology/cytology modality 2) Identify bilobed nucleus + eosinophilic granules 3) Conclude “eosinophil” Direct answer: False (correct) CoT answer: True (incorrect)

GPT-generated CoT misfocused on generic lymphocyte features (scant cytoplasm, round nucleus) instead of key granularity, illustrating how CoT can amplify misinterpretation.

6.2 Case Study 2 (Ophthalmic Treatment Planning) Q: “Which is first-line treatment? A) Surgery B) Topical steroids + dilators C) No treatment D) Laser therapy.” Gold CoT: anterior uveitis → treat with steroids + dilators (B). MedGemma-27B CoT: misdiagnosed angle-closure glaucoma → selected D. Direct answer: B (correct); CoT answer: D. Highlights how erroneous intermediate inference can override an otherwise correct direct response.

Limitations & Future Directions

7.1 Limitations

Annotation: occasional dataset label inconsistencies and unrealistically precise disease labels.
Subjectivity: variation in phrasing can affect automated matching.
Evaluation bias: correctness and consistency judged by LLMs (GPT-4o, Gemini 2.5) without human cross-validation.
No inter-annotator agreement statistics or multiple runs / confidence intervals.
Limited prompt ablations and hyperparameter explorations.

7.2 Future Directions

Introduce human‐validated scoring for CoT steps and answers.
Report inter‐annotator agreement and statistical significance (multiple runs, confidence intervals).
Explore prompt engineering and adaptive CoT elicitation (e.g., selectively deeper reasoning on high-difficulty items).
Extend to multi‐image clinical scenarios (e.g., series MRI).
Incorporate feedback loops where expert revisions refine model reasoning during training.

M3CoTBench provides a structured framework to quantify not only “what” a medical MLLM predicts, but “how” it reasons, fostering development of transparent and clinically trustworthy AI systems.

Markdown Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adversarial Evasion Attacks.

Adversarial Evasion Attacks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics