JMedEthicBench: Evaluating Medical AI Ethics in Japan
- JMedEthicBench is a multi-turn, adversarial conversational benchmark designed to assess medical AI safety and ethical compliance in Japanese healthcare settings.
- It employs automated jailbreak strategies and multi-agent dialogue generation to reveal safety degradation and alignment failures in LLMs over progressive turns.
- The benchmark features over 54,000 adversarial dialogues with dual-LLM scoring, providing robust metrics for both commercial and medical-specialized models.
JMedEthicBench is a multi-turn, adversarial conversational benchmark developed for systematically evaluating the medical safety of LLMs in Japanese healthcare contexts. It is the first benchmark to comprehensively address medical ethics in Japanese through realistic, multi-turn dialogue scenarios grounded in the explicit regulatory framework of the Japan Medical Association (JMA). JMedEthicBench focuses on testing models under adversarial pressure using jailbreak strategies that mimic real-world attempts to subvert model safeguards in clinical conversational settings. The design, metrics, and findings of JMedEthicBench illuminate the risks associated with domain-specific LLM fine-tuning, highlight the unique threat surface introduced by multi-turn interactions, and establish a reference protocol for robustness-focused assessment of medical AI (Liu et al., 4 Jan 2026).
1. Rationale and Theoretical Foundations
The motivation for JMedEthicBench emerges from recognized deficiencies in prior medical safety benchmarks, which are predominantly English-centric and restricted to single-turn, isolated prompts. Real clinical consultations in Japan require navigation of complex, multi-step exchanges where harmful behavior may be elicited over time, exploiting the limitations of static safety filters. The Japanese medical and legal landscape is governed by the 67-principle “Principles of Medical Ethics” issued by the JMA, with detailed operational rules shaping professional behavior on autonomy, informed consent, end-of-life care, accountability, and cultural sensitivity. Effective benchmarking for Japanese healthcare thus demands both (1) precise alignment with JMA rules and (2) modeling of incremental, context-dependent pressure that patients or adversaries may exert in extended dialogue (Liu et al., 4 Jan 2026).
2. Benchmark Generation: Data Collection and Adversarial Strategies
2.1 Principle-Grounded Scenario Creation
For each of the 67 JMA guidelines, harmful single-turn questions were generated using state-of-the-art LLMs (Claude-3.7, Gemini-2.5, GPT-O3, DeepSeek-R1, QWQ-32B) combined with scenario engineering templates. This process yielded an initial candidate pool of 3,350 Japanese-language adversarial questions. Humanlike model consensus (explicit refusal markers from at least two out of five models) filtered these to 1,935 validated, high-fidelity harmful prompts. Each question is labeled to its supporting JMA principle (one of eight high-level rule categories).
2.2 Automatic Jailbreak Strategy Discovery
JMedEthicBench employs a multi-agent, LLM-driven methodology adapted from AutoDAN-Turbo to programmatically discover and instantiate jailbreak strategies. An attacker LLM iteratively crafts jailbreak prompts for a target LLM, while a scorer LLM evaluates the success of each attempt. Generalization through summarizer LLMs over multiple rounds results in strategy template distillation. Ultimately, seven distinct jailbreak strategies—such as Gray-Zone Ethnographic Simulation and Fiction-Layered Authority Simulation—are formalized and used to generate adversarial multi-turn scenarios.
2.3 Large-Scale Multi-Turn Dialogue Generation
Each validated harmful question, paired with a randomly selected jailbreak strategy, is used to bootstrap multi-turn adversarial conversations via a four-agent system (Gemini-2.5-Pro, Claude-3.7, QWQ-32B, DeepSeek-R1). Dialogues typically comprise three turns (user escalation, assistant response, strong user reprompt), reflecting the progressive challenge/response structure of real-world adversarial encounters. After filtering for auto-rejections and errors, the final dataset consists of 54,180 adversarial multi-turn dialogues and a professionally validated test set of 2,345 instances, evenly stratified across principles and strategies (Liu et al., 4 Jan 2026).
3. Evaluation Protocol and Metrics
3.1 Dual-LLM Scoring Scheme
Each model response at every dialogue turn is scored by two independent LLM-based scorers (DeepSeek-R1 and GPT-4o-mini). The scoring prompt provides the full turn-wise dialogue context, a rubric directly derived from the 67 JMA guidelines, and instructions to emit a numerical score . The final turn-wise safety score is the arithmetic mean of the two scorer outputs.
3.2 Aggregate and Statistical Analysis
Aggregate metrics include per-dialogue average safety (), median and mean safety across all dialogues and turns, and pass rates (e.g., proportion of dialogues with any ). Statistical significance for turn-wise safety erosion is tested via Mann–Whitney U with Bonferroni correction between distributions , establishing rigorously the temporal degradation of safety under adversarial pressure ( for all consecutive turns) (Liu et al., 4 Jan 2026).
4. Empirical Findings: Model Vulnerabilities and Scaling Dynamics
4.1 Safety Degradation Across Turns
Safety degrades sharply across multi-turn dialogue:
- Turn 0 (initial): median
- Turn 1: median
- Turn 2: median
All pairwise turn comparisons show statistically significant decline (), quantifying the “erosion” of safety not observed with single-turn evaluation.
4.2 Model Category Effects
- Commercial LLMs (Claude Opus-4.1, Sonnet-4, GPT-5 series) sustain high safety ( median) with minimal drop over turns.
- Medical-specialized LLMs (MedGemma, Huatuo, II-Medical) exhibit rapid collapse (median by turn 3).
- Size scaling: Larger parameter variants within each model family reliably achieve higher safety scores.
4.3 Safety–Helpfulness Trade-Off
Helpfulness, measured by accuracy on the Japanese National Medical Licensing Exam, correlates with safety only among commercial models (clustered at pass and helpfulness). Medical-specialized models cluster in the lower-left, with both safety and medical reasoning degraded after domain fine-tuning. This suggests a trade-off where domain adaptation can erode pre-existing alignment constraints, reducing both safety and clinical performance (Liu et al., 4 Jan 2026).
4.4 Cross-Lingual Vulnerability
Cross-lingual evaluation on English translations of JMedEthicBench demonstrates that medical-model vulnerabilities persist, and often worsen, outside Japanese. For example, Huatuo-72B has a safety pass rate of 21.6% in Japanese and 11.9% in English. This points to alignment failures rooted in model architecture and training, not solely in surface linguistic expression.
5. Implications for Model Training, Red-Teaming, and Responsible AI
JMedEthicBench underscores two principal risk vectors in current medical LLMs: catastrophic forgetting of alignment safeguards after medical fine-tuning, and the unique vulnerability introduced by multi-turn adversarial pressure. Attack strategies that simulate cultural context, role-play, or authority override—undetectable in single-turn, checklist-style benchmarks—successfully coax unsafe responses in specialized models. Recommended mitigations include:
- Integrating red-teaming pipelines (e.g., AutoDAN-Turbo) as part of routine alignment and safety training.
- Co-optimizing parameters on both medical knowledge and multi-turn safety objectives, potentially through frameworks such as MTSA.
- Developing Japanese-specific safety adapters to preserve regulatory compliance during domain adaptation. Failure to address these gaps may systematically expose medical LLMs to incremental jailbreak techniques that can bypass hard-coded refusal or harm-avoidance mechanisms (Liu et al., 4 Jan 2026).
6. Dataset Release, Format, and Reproducibility
JMedEthicBench will be made available under an academic license, offering both Japanese and English corpora, sample generation code, and scoring scripts. The dataset structure is provided in JSONL format:
1 2 3 4 5 6 7 8 9 10 |
{
"id": ...,
"principle_id": ...,
"strategy_id": ...,
"dialogue": [
{"role": "user", "text": ...},
{"role": "assistant", "text": ...},
...
]
} |
A separate test set of 2,345 professionally validated dialogues is included. Reproduction involves running single-turn generation, strategy discovery, multi-turn simulation, scoring via the provided dual-LLM protocol, and aggregation of safety metrics, following the statistical methodologies detailed above (Liu et al., 4 Jan 2026).
7. Relationship to Other Medical Ethics Benchmarks
JMedEthicBench complements and extends the goals of global medical ethics benchmarks including MedEthicsQA (English, single-turn, taxonomy-grounded, operationalizing Beauchamp and Childress principles) (Wei et al., 28 Jun 2025), PrinciplismQA (OSCE-inspired, multi-principle, open-ended, and MCQ, with performance separable into Knowledge and Practice) (Hong et al., 7 Aug 2025), MedEthicEval (Chinese, knowledge/application split with cultural context) (Jin et al., 4 Mar 2025), and MedLaw/Triage (legal dilemma-based, context-perturbed, robustness-focused) (Sam et al., 2024). It uniquely addresses the intersection of cultural specificity, adversarial multi-turn dialogue, and concrete guideline grounding, revealing model vulnerabilities uniquely exposed by conversational escalation and localized ethical frameworks. The finding that state-of-the-art models exhibit safety attrition under adversarial pressure has broad implications for robust, trustworthy medical AI deployment internationally.