Papers
Topics
Authors
Recent
2000 character limit reached

MedBench v4: Medical AI Benchmarking

Updated 25 November 2025
  • MedBench v4 is a nationwide, cloud-based platform that benchmarks Chinese medical AI models, multimodal systems, and intelligent agents using over 700,000 expert-curated clinical tasks.
  • It employs a calibrated LLM-as-a-judge pipeline alongside dual human scoring to deliver precise metrics on safety, ethics, and cross-modal reasoning.
  • The infrastructure supports both API and answer-upload modes, ensuring secure, scalable evaluations for hospitals, developers, and regulatory bodies.

MedBench v4 is a nationwide, cloud-based benchmarking infrastructure designed for the evaluation of Chinese medical LLMs, multimodal models, and intelligent agents. Developed under China’s national AI healthcare pilot base in close collaboration with over 500 medical institutions, it encompasses more than 700,000 expert-curated tasks aligned with real clinical workflows and regulatory standards. Three dedicated evaluation tracks—LLM, multimodal, and agent—capture the diverse spectrum of medical AI capabilities, and all outputs are scored via a calibrated LLM-as-a-judge pipeline. MedBench v4 provides a robust, broad-coverage standard for hospitals, developers, and policymakers auditing medical AI for clinical and safety compliance (Ding et al., 18 Nov 2025).

1. Benchmark Infrastructure and Scope

MedBench v4 operates as a secure, cloud-hosted platform fulfilling nationwide benchmarking requirements. It supports two submission modes: (1) API mode, with encrypted, randomized test items pushed directly to registered endpoints; and (2) an answer-upload mode, where clients infer on stratified splits and upload only predictions for centralized scoring. All ground-truth annotations remain server-side, with strict version-locking and rate-limits to ensure auditability and prevent overfitting.

The task suite—over 700,000 items drawn from 36 datasets—spans 24 primary and 91 secondary clinical specialties. Tasks are mapped to four application domains defined by the National Health Commission AI framework: documentation structuring, clinical decision support, quality control, and operational management. Evaluation is partitioned into three tracks:

  • LLM track: 36 text-only test sets, assessing five capabilities: Medical Language Understanding (MLU), Medical Language Generation (MLG), Medical Knowledge QA (MKQA), Complex Medical Reasoning (CMR), and Healthcare Safety & Ethics (HSE).
  • Multimodal track: 10 datasets spanning visual perception (detection, classification, OCR), cross-modal reasoning (VQA, report generation, QC), and image-conditioned decision support (diagnostics, therapy).
  • Agent track: 14 datasets probing goal decomposition, tool/API invocation, intent recognition, long-context processing, multi-agent collaboration, and safety/adversarial robustness, with orchestration modules layered on existing LLM backbones.

2. Task Curation and Review Protocols

Task curation involves multi-stage, clinician-centered refinement. More than 500 institutional partners (hospitals, societies, and academic centers) contribute de-identified clinical cases and scenarios. Items are processed through initial drafting, terminology normalization, de-identification, machine-readable formatting, dual clinician review, senior expert adjudication, and final quality assurance.

Open-ended tasks employ Qwen2.5-72B-Instruct as a meta-prompted LLM-as-a-judge, calibrated along four anchor dimensions (correctness, professionalism, compliance/safety, usability; rated 0–5). Approximately 20% of items are doubly scored by a pool of over 1,000 licensed physicians, enabling human-model calibration using both raw agreement and Cohen’s kappa: po=1Ni=1N1[sj(i)=sh(i)]p_o = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[s_j(i)=s_h(i)]

pe=kpj,kph,kp_e = \sum_{k} p_{j,k}\,p_{h,k}

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

Observed Cohen’s κ\kappa exceeds 0.82 on key tasks, confirming high concordance between human and LLM adjudication. Regular human audits correct for potential systematic biases.

3. Metrics and Scoring Framework

All raw task-level scores are linearly rescaled to [0,100][0, 100] and then macro-averaged. For model mm, let TT be the set of all tasks and DD the number of capability dimensions. TdT_d represents tasks within dimension dd, and sm,ts_{m,t} the rescaled score on task tt.

Overall score: Overallm=1Dd=1D[1TdtTdsm,t]\text{Overall}_m = \frac{1}{D}\sum_{d=1}^{D} \left[\frac{1}{|T_d|}\sum_{t\in T_d} s_{m,t}\right] Key subsets include:

  • Safety: (HSE for LLMs, or dedicated sets for agents)

Safetym=1TsafetytTsafetysm,t\text{Safety}_m = \frac{1}{|T_{\text{safety}}|}\sum_{t\in T_{\text{safety}}} s_{m,t}

  • Ethics: (subset of HSE)

Ethicsm=1TethicstTethicssm,t\text{Ethics}_m = \frac{1}{|T_{\text{ethics}}|}\sum_{t\in T_{\text{ethics}}} s_{m,t}

  • Cross-modal reasoning: (multimodal track)

CrossModalm=1TcrosstTcrosssm,t\text{CrossModal}_m = \frac{1}{|T_{\text{cross}}|}\sum_{t\in T_{\text{cross}}} s_{m,t}

Task-level metrics include exact-match, micro-F1, macro-key-point recall, intersection over union (IoU), 1-NED, and macro-recall as appropriate.

4. Experimental Results and Performance Landscape

MedBench v4 evaluates LLMs, multimodal models, and agentic orchestration using 15 state-of-the-art architectures. Core results are summarized as follows:

Track Mean Score Best Model Best Score
LLM 54.1 Claude Sonnet 4.5 62.5
Multimodal 47.5 GPT-5 54.9
Agent 79.8 Claude-based agent 85.3
  • Base LLMs (mean overall 54.1/100): Safety & ethics scores are low (mean 18.4/100), with underperformance linked to factual hallucinations, lack of embedded safeguards, and limited clinical risk representation in general-purpose chat interfaces.
  • Multimodal models (mean overall 47.5/100): High performance in visual perception (often >70/100), but cross-modal reasoning and clinical integration lag (30–40/100), underscoring the difficulty of harmonizing textual and image modalities.
  • Agent architectures (mean overall 79.8/100): Substantial improvement, especially in safety & ethics (mean 73.4/100, best agent 88.9/100), attributable to systematic planning, tool governance, safety modules, and memory augmentation.

The mean performance uplift through agentic orchestration (Δ\overline{\Delta}) is approximately 25.7 points (relative uplift ~47.6%), i.e.,

Δm=SmagentSmbase,Uplift%=SmagentSmbaseSmbase×100%\Delta_m = S^{\rm agent}_m - S^{\rm base}_m,\quad \text{Uplift}\% = \frac{S^{\rm agent}_m - S^{\rm base}_m}{S^{\rm base}_m} \times 100\%

A plausible implication is that structured agentic pipelines can significantly boost both technical and safety dimensions of clinical AI readiness while preserving or enhancing core task-solving capacity.

5. Clinical and Regulatory Alignment

All test item content and categories are mapped directly to Chinese clinical guidelines and the National Health Commission’s mandated AI application scenarios (e.g., EMR structuring adheres to Chinese ICD standards, drug auditing tasks are based on national safety protocols). The benchmark is developed in concert with hospitals, medical societies, and regulators, ensuring scenario fidelity and policy relevance.

  • Hospitals gain an authoritative external reference for AI procurement, validation, and routine quality assurance.
  • AI developers access a unified, comprehensive gold-standard reflecting contemporary Chinese clinical and regulatory practice.
  • Regulatory stakeholders can leverage MedBench v4 for audit frameworks, ongoing compliance monitoring, and oversight of emerging medical AI deployments.

6. Significance and Outlook

MedBench v4 establishes a scalable, audit-ready infrastructure for evaluating Chinese medical LLMs, multimodal models, and agentic systems, with strong alignment to real-world clinical workflows and regulatory priorities. It highlights persistent deficiencies in base and multimodal models, especially for safety/ethics and cross-modal reasoning, and empirically demonstrates that governance-aware agent orchestration can markedly enhance both performance and clinical acceptability without sacrificing breadth of capability (Ding et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MedBench v4.