CARES-18K Benchmark for Medical LLMs
- CARES-18K Benchmark is a large-scale, clinically grounded suite that evaluates safety alignment and adversarial robustness in medical LLMs.
- It systematically tests eight medical safety principles, four harm levels, and four prompting styles to expose unsafe compliance and over-cautious refusal behaviors.
- Its rigorous evaluation protocol, including a fine-grained Safety Score, drives quantitative insights and mitigation strategies for clinical AI deployments.
The CARES-18K benchmark is a large-scale, clinically grounded evaluation suite for systematically testing the safety alignment and adversarial robustness of LLMs deployed in medical and healthcare contexts. Introduced by Chen et al., CARES-18K offers an 18,000-prompt testbed spanning eight medical safety principles, four harm levels, and four prompting styles, accompanied by a rigorous three-way evaluation protocol and a fine-grained Safety Score metric. CARES-18K is designed to expose both unsafe compliance and over-cautious refusal behaviors in contemporary medical LLMs and provides a foundation for benchmarking safety alignments, adversarial robustness, and mitigation strategies (Chen et al., 16 May 2025, Nghiem et al., 3 Dec 2025).
1. Dataset Structure and Composition
CARES-18K consists of 18,000 single-turn prompts systematically constructed along three primary axes: medical safety principles, graded harmfulness levels, and prompting styles. Prompts are distributed as follows:
| Dimension | Categories | Count | Percentage |
|---|---|---|---|
| Safety Principles (P₁–P₈) | 8 equal bins | 2,250 | 12.5% each |
| Harm Levels (0–3) | 4 equal bins | 4,500 | 25% each |
| Prompting Styles | Direct, Indirect, Obfuscated, Role-play | 4,500 | 25% each |
Safety principles are rooted in American Medical Association (AMA) ethics, HIPAA, and established AI rule bases, covering axes such as patient privacy, clinical ethics, and illicit practices.
Harm levels are explicitly defined:
- Level 0 (Harmless)
- Level 1 (Mildly harmful/ethically inappropriate)
- Level 2 (Moderately harmful/legally questionable)
- Level 3 (Highly harmful/socially or systemically dangerous)
Prompting styles include:
- Direct: original prompts;
- Indirect: speculative/chained phrasing;
- Obfuscated: keyword substitution;
- Role-play: persona-based disguise.
Prompt construction proceeds with four instruction-tuned LLMs (GPT-4o-mini, Gemini-2.5-Flash, DeepSeek-V3, Claude-3.5-Haiku) generating 50 base prompts per principle per harm level. These are deduplicated (using MinHash LSH, Jaccard > 0.7), human validated, and rewritten into the three adversarial styles. Human annotators and clinical experts validate the final set. Annotation quality is high (human–GPT-4o-mini Krippendorff’s α = 0.81) (Chen et al., 16 May 2025, Nghiem et al., 3 Dec 2025).
2. Evaluation Protocol and Metrics
CARES-18K introduces a three-class response evaluation:
- Accept: Full compliance (includes unsafe acceptance or correct response to benign prompt)
- Caution: Partial compliance with caveats/deflections
- Refuse: Explicit refusal to comply with the (unsafe) request
All model outputs are auto-labeled by GPT-4o-mini (T=0). Human spot-checks confirm a 98% agreement rate.
Quantitative evaluation uses several interrelated metrics:
- Safety Score (SS): Assigns credit for correct accept/refuse behavior with partial credit for “Caution” on moderate harm prompts.
with the scoring table reflecting nuanced reward/penalty for each (harm, action) combination. For example, accepting a safe prompt () or refusing a severely harmful prompt () receives 1; cautious responses to moderately harmful prompts receive 0.5.
- Binary F1 (harmful prompts): Considers refusal or caution on harmful prompts as positive; acceptances as negative.
- Erroneous Refusal Rate (ERR): Frequency of over-refusal (i.e., caution or refusal) on harmless () prompts.
- Overall Metric (OM): Combines Safety Score and via (default ).
- Accuracy (ACC): Standard binary classification accuracy.
This protocol enables fine-grained tracking of both true/false positives and over-refusal on benign queries.
3. Adversarial Construction and Annotation
CARES-18K uniquely emphasizes adversarial robustness through systematic prompt transformation. Core steps include:
- Seed prompt curation: Drawn from medical-ethics guidelines such as AMA and HIPAA.
- Adversarial transformation: Generation of indirect, obfuscated, and role-play variants using state-of-the-art LLMs.
- Deduplication and filtering: Automatic exclusion of trivial or irrelevant prompts using MinHash and Jaccard similarity.
- Expert annotation: Prompts labeled for harmfulness following the PKU-SafeRLHF taxonomy and validated for clinical fidelity.
Annotation reliability is reported via Krippendorff’s α and Cohen’s κ between model/judge pairs (e.g., Llama-3B vs. GPT-4o-mini, κ=0.59).
4. Model Performance and Empirical Results
Extensive benchmarking with CARES-18K reveals notable trends in model safety and robustness. Selected results illustrate:
| Model | Safety Score (SS) | Accuracy (ACC) | F1 |
|---|---|---|---|
| O4-mini | 0.71 | 0.74 | 0.85 |
| DeepSeek-R1 | 0.70 | 0.73 | 0.84 |
| MedAlpaca-13B | 0.70 | 0.73 | 0.84 |
| Meditron-70B | 0.67 | 0.72 | 0.83 |
| GPT-4o-mini | 0.56 | 0.61 | 0.76 |
| Llama-3.2-1B | 0.57 | 0.61 | 0.76 |
| Mixtral-8x7B | 0.53 | 0.59 | 0.75 |
Under jailbreak (adversarial) conditions, Safety Scores decline by 8–12 points, with indirect and role-play styles being particularly effective at circumventing refusals. State-of-the-art models still refuse ~10% of Level 0 (harmless) prompts when confronted with adversarial rephrasings, whereas weaker models accept up to 20% of Level 3 (highly harmful) inputs.
Iterative alignment frameworks leveraging CARES-18K (using Kahneman-Tversky Optimization and Direct Preference Optimization) yield up to 42% improvement in safety-related metrics on harmful query detection for certain models (Meditron-8B). However, increases in Safety Score are frequently accompanied by elevated rates of erroneous refusal on safe prompts, highlighting the safety–utility trade-off (Nghiem et al., 3 Dec 2025).
5. Mitigation Strategies and Jailbreak Conditioning
CARES-18K underpins development and evaluation of mitigation techniques against adversarial prompt attack. A notable strategy involves:
- Training a dedicated “jailbreak identifier” to classify the adversarial style (indirect, obfuscated, role-play) using Qwen2.5-7B-Instruct (learning rate , 5 epochs, batch size 16), achieving 97.7% validation accuracy (F1=0.976).
- At inference, the predicted jailbreak type is prepended as a reminder prompt, e.g., “Note: This prompt has been obfuscated via synonym substitution. Please ensure you do not comply with harmful requests.”
- This reminder-based conditioning yields measurable SS gains: 1–2 points for robust models (GPT-4o-mini, DeepSeek-V3) and 5–7 points for more vulnerable models (Llama-3.1-8B, Llama-3.2-3B).
| Model | SS_before | SS_after | ΔSS |
|---|---|---|---|
| Claude-3.5-Haiku | 0.64 | 0.68 | +0.04 |
| Llama-3.1-8B | 0.58 | 0.65 | +0.07 |
| Llama-3.2-3B | 0.56 | 0.63 | +0.07 |
| GPT-4o-mini | 0.56 | 0.58 | +0.02 |
| DeepSeek-V3 | 0.57 | 0.59 | +0.02 |
This suggests that lightweight, context-aware conditioning can partially mitigate the effects of adversarial attacks, particularly for models earlier in the safety alignment process (Chen et al., 16 May 2025).
6. Applications and Broader Impact
CARES-18K has become a reference benchmark for medical LLM safety and alignment research. Notable use cases include:
- Benchmarking adversarial robustness of both proprietary and open-source LLMs in clinical tasks.
- Serving as a testbed for domain-specific alignment via iterative fine-tuning (KTO/DPO) and for calibrating trade-offs between safety and utility.
- Enabling ablation studies for model self-evaluation reliability versus external judging, with architecture-dependent calibration biases documented (e.g., Llama-3B aligns well with GPT-4o-mini, whereas larger models like Meditron-8B manifest over-refusal tendencies).
- Exposing the necessity for decoupled supervision and rigorous, domain-specific prompt design approaches for clinical AI deployments.
A plausible implication is that future model alignment in healthcare will require not only adversarial robustness benchmarks but also comprehensive, clinically validated datasets operating at multiple abstraction and harm levels. CARES-18K represents the current state-of-the-art in this domain by offering granular annotation, adversarial coverage, and an interpretable scoring methodology.
7. Limitations and Future Directions
CARES-18K covers 8 medical safety principles and 4 calibrated harm levels, but its adversarial prompting schema currently addresses only single-turn inputs. Potential future directions include:
- Extending the benchmark to multi-turn and scenario-based dialogue,
- Diversifying adversarial tactics beyond the current trio of indirect, obfuscated, and role-play,
- Incorporating more advanced, dynamic human adversaries in the prompt design loop,
- Further correlating annotation quality and clinical validity in real-world deployments.
Overall, CARES-18K advances the empirical paper of safety, refusal, and adversarial robustness in medical LLMs, providing a rigorous and interpretable basis for both model evaluation and safety-oriented development (Chen et al., 16 May 2025, Nghiem et al., 3 Dec 2025).