PEDIASBench: Pediatric AI Benchmark

Updated 24 November 2025

PEDIASBench is a framework that benchmarks pediatric competency in LLMs through a tripartite architecture of foundational knowledge, dynamic reasoning, and ethical safety.
It covers 19 subspecialties and 211 diseases using clinical vignettes, multi-modal tasks, and test items that simulate real pediatric care.
The evaluation employs rigorous metrics for accuracy, reasoning, and ethics to identify performance gaps and guide future AI development.

PEDIASBench is a systematic evaluation framework designed to rigorously assess the pediatric competency of LLMs in realistic clinical scenarios. It implements a knowledge-system architecture comprising foundational knowledge, dynamic reasoning, and pediatric ethics, capturing the core competencies required for safe and effective pediatric care. PEDIASBench benchmarks LLMs over 19 pediatric subspecialties and 211 prototypical diseases, provides multi-modal and multi-dimensional test items, and reports detailed performance analyses and error profiles, thereby identifying critical limitations and guiding the direction for future development in pediatric AI (Zhu et al., 17 Nov 2025).

1. Knowledge-System Architecture

PEDIASBench's design is grounded in a three-layer knowledge-system framework closely aligned with real-world pediatric clinical practice:

Foundational Knowledge Layer: Synthesizes static knowledge from pediatric textbooks, clinical protocols, and content mandated by licensing examinations. This layer assesses factual recall, application of guidelines, calculation of weight-based dosing, and presence of developmental norms.
Dynamic Reasoning Layer: Tests the ability to update clinical hypotheses, integrate new findings, and iteratively refine diagnostic and therapeutic strategies over time. This layer is operationalized through multi-stage patient vignettes, simulating changing presentations and requiring time-dependent adaptation of clinical reasoning.
Ethical-Safety Layer: Encompasses principles and actions underlying safe, humanistic care—including beneficence, non-maleficence, justice, informed consent, professionalism, patient-centered communication, and systemic safety management.

This tripartite architecture permits granular evaluation of both knowledge depth and practical capability, modeling the multifaceted demands placed on practicing pediatricians (Zhu et al., 17 Nov 2025).

2. Task Construction and Scope

PEDIASBench spans a representative array of clinically relevant material:

Coverage: 19 pediatric subspecialties (e.g., neonatology, cardiology, endocrinology, pediatric surgery) and 211 prototypical diseases, ensuring an exhaustive sampling of both internal medicine and surgical cases.
Task Types:
- Application of Basic Knowledge: Single-choice and multiple-choice items derived from multi-tiered pediatric licensing examinations (Resident, Junior, Intermediate, Senior).
- Dynamic Diagnosis and Treatment Capability: Interactive, longitudinal vignettes presented in two temporal nodes: T1 (initial presentation and initial management) and T2 (new clinical data leading to diagnostic refinement and management updates).
- Pediatric Medical Safety and Ethics: Single- and multiple-choice scenarios encompassing 10 key subdomains, such as clinical ethics, informed consent, communication, and quality/safety management.

The construction of task items is explicitly aligned with authentic clinical workflows, promoting ecological validity of evaluation (Zhu et al., 17 Nov 2025).

3. Evaluation Metrics

PEDIASBench employs a set of rigorous quantitative metrics structured to probe both discrete response accuracy and open-ended reasoning:

Accuracy for Single-Choice Questions:

$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$

Multiple-Choice Metrics:

$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \quad\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

$F_1 = 2\,\frac{\text{Precision}\times \text{Recall}}{\text{Precision} + \text{Recall}}$

Composite Score for Open-Ended Responses:

$\text{Score}_\text{open} = 0.7 \times \text{MacroRecall} + 0.3 \times \text{BERTScore}$

where

$\text{MacroRecall} = \frac{1}{M}\sum_{i=1}^M \frac{\text{CoveredKeyPoints}_i}{\text{TotalKeyPoints}_i}$

Dynamic Case Reasoning:

$\overline{S}_\mathrm{reasoning} = \frac{1}{N}\sum_{i=1}^N s_i$

with $s_i$ scored 0–1 per case (completeness and adaptability).

Pediatric Medical Ethics & Safety:

$\text{EthicsAccuracy} = \frac{\#\,\text{CorrectEthicsResponses}}{\#\,\text{TotalEthicsItems}}$

These metrics collectively enable precise dissection of factual mastery, integrative reasoning capability, and conformity to pediatric ethics and safety standards (Zhu et al., 17 Nov 2025).

4. Model Performance Overview

A comparative analysis of 12 recent LLMs on PEDIASBench highlights both their strengths and persistent shortcomings:

Dimension	Top Model(s)	Peak Performance	Drop-off/Notes
Basic Knowledge (Single-Choice)	Qwen3-235B-A22B	91.8% (Resident)	−15% at Senior level, sharp with difficulty
Basic Knowledge (Multiple-Choice, F₁)	Llama-4-Maverick, Gemini-2.5	0.976, 0.968 (Resident)	15–20% accuracy drop at Senior
Dynamic Reasoning (Mean Reasoning Score)	DeepSeek-R1	0.58 (best)	Internal med: 0.62 (DeepSeek-R1), Surgery: 0.54 (GPT-4o)
Ethics & Safety	Qwen2.5-72B	92.05%	Most models ≥90% on communication, lowest 75%

Foundational knowledge retention is robust for state-of-the-art models, with over 90% accuracy on residency-level licensing content. Performance consistently declines as item complexity and demand for integrative or longitudinal reasoning increases (by ~15–20% in both accuracy and F₁). In dynamic case simulations, no model approaches expert-level performance or demonstrates robust adaptive decision-making. Ethics and safety compliance are high (>90%), but free-text, empathic, or developmentally sensitive output remains formulaic and lacks genuine warmth (Zhu et al., 17 Nov 2025).

5. Error Analysis and Identified Weaknesses

Analysis of LLM failure modes on PEDIASBench highlights multiple persistent limitations:

Integrative Reasoning Deficits: Marked F₁ and recall reduction when items suture knowledge across specialties (e.g., cardiology-genetics questions).
Dynamic Decision-Making Shortfall: Sub-60% mean reasoning scores in scenarios requiring rapid recalibration with new clinical data.
Humanistic Care Deficiency: Ethics and professionalism scores high on structured items, but models exhibit formulaic, non-nuanced responses in open-text empathy and communication. Child-friendly language and developmentally sensitive communication are notably absent.
Systematic Measurement Approaches: Error analysis includes key-point omissions in open-ended responses, confusion matrices illuminating false positive/negative asymmetries in complex domains (e.g., oncology), and qualitative reviews of dialogue fluency and appropriateness (Zhu et al., 17 Nov 2025).

These findings signal that substantial competency gaps remain in the application of practical, integrative clinical knowledge and the dynamic, empathic interaction required in pediatrics.

6. Recommendations and Future Trajectories

The PEDIASBench authors propose concrete strategies to address LLM shortcomings uncovered by their framework:

Multimodal Integration: Incorporating image, vital sign, and lab data through vision-language or sensor-augmented models to enrich data inputs available to LLMs.
Clinical Feedback Loops: Real-time pediatric clinician review and correction to iteratively refine model parameters, promoting adaptation to authentic clinical workflows.
Enhanced Interpretability: Retrieval-augmented generation with automated source citation (e.g., AAP guidelines) to improve transparency, trust, and auditability.
Human–AI Collaboration: Tools for interactive clinician-guided model reasoning, enabling query of “why” and “what if” scenarios, and supporting co-authorship of care plans.

PEDIASBench thus provides a competency-based, ethically anchored foundation for next-generation pediatric AI. While current LLMs demonstrate promise as adjuncts for decision support, medical education, and patient communication, their present limitations unequivocally prohibit unsupervised clinical deployment. The pathway to fully trusted pediatric AI necessitates advances in multimodal integration, feedback-driven learning, robust interpretability, and enhanced human-AI interaction (Zhu et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts (2025)

PEDIASBench: Pediatric AI Benchmark

1. Knowledge-System Architecture

2. Task Construction and Scope

3. Evaluation Metrics

4. Model Performance Overview

5. Error Analysis and Identified Weaknesses

6. Recommendations and Future Trajectories

Whiteboard

Follow Topic

Continue Learning

PEDIASBench: Pediatric AI Benchmark

1. Knowledge-System Architecture

2. Task Construction and Scope

3. Evaluation Metrics

4. Model Performance Overview

5. Error Analysis and Identified Weaknesses

6. Recommendations and Future Trajectories

Whiteboard

Follow Topic

Continue Learning

Related Topics