MedHELM Framework for Medical LLM Evaluation
- MedHELM is an extensible, clinician-validated framework that evaluates LLM performance across a comprehensive set of real-world medical tasks.
- It employs a clinician-derived taxonomy covering 5 categories and 121 tasks, ensuring that benchmarks align with practical clinical workflows.
- The framework integrates LLM-jury evaluation, structured summarization, and multimodal extensions to deliver transparent, cost-effective model comparisons.
The MedHELM framework is an extensible, clinician-validated platform for evaluating LLMs across a comprehensive array of medical tasks and workflows. Designed to address the limitations of traditional medical AI benchmarks—which often overemphasize licensing exam-style multiple-choice accuracy—MedHELM systematically measures LLM performance on real-world medical activities, spanning clinical decision support, note generation, patient communication, research assistance, and administrative processes. The framework integrates curated benchmarks, a domain-expert-derived taxonomy, and modern LLM-jury evaluation methods for transparent, reproducible, and cost-aware model comparison (Bedi et al., 26 May 2025).
1. Taxonomy and Scope of Medical Tasks
MedHELM centers on a clinician-validated taxonomy encompassing 5 principal categories, 22 subcategories, and 121 granular medical tasks. This taxonomy was developed and refined with input from 29 clinicians across 14 specialties, achieving a 96.7% inter-rater agreement in subcategory classification and a comprehensiveness rating of 4.21 out of 5. The five top-level categories are:
- Clinical Decision Support: Pattern recognition, diagnostics, treatment planning, risk/outcome prediction, and knowledge support.
- Clinical Note Generation: Generation and summarization of patient notes, procedural documentation, diagnostic and care plan documentation.
- Patient Communication & Education: Creation of educational content, patient instructions, messaging support, accessibility features, and engagement tracking.
- Medical Research Assistance: Literature review, trial result analysis, research process documentation, regulatory and quality compliance, and enrollment management.
- Administration & Workflow: Scheduling, financial task automation, workflow management, and care coordination (Bedi et al., 26 May 2025).
This taxonomy enables comprehensive coverage of real-world medical workflows, moving beyond synthetic or limited scenario-based benchmarks.
2. Benchmark Suite Design
The MedHELM benchmark suite maps directly to the full taxonomy and comprises 35 tasks, including public datasets, re-formulated benchmarks, and private/gated collections derived from electronic health records (EHR). Benchmarks are explicitly categorized by accessibility (public/gated/private) and evaluation design (closed-ended vs. open-ended). Example benchmarks include MedQA (exam questions), NoteExtract (care plan restructuring), MedDialog (dialogue summarization), and MIMIC-IV Billing Code (coding prediction) (Bedi et al., 26 May 2025).
Each benchmark is tailored to the corresponding subcategory, ensuring alignment between tested model capabilities and actual clinical practice requirements.
3. LLM-Jury Evaluation Methodology
MedHELM adopts an LLM-jury evaluation paradigm for open-ended and subjective benchmarks. Outputs are rated by an ensemble of three strong LLMs (e.g., GPT-4o, Claude 3.7 Sonnet, LLaMA 3.3 70B Instruct) across dimensions of accuracy, completeness, and clarity (Likert-5 scale). For specific tasks, criteria such as output structure may be substituted.
Inter-rater agreement between LLM-jurors and clinicians is measured using the intraclass correlation coefficient (ICC). Across two benchmark datasets, LLM-jury ICC reached 0.47, exceeding typical clinician–clinician agreement (0.43) and surpassing automated metrics such as ROUGE-L (0.36) and BERTScore-F1 (0.44). This approach enables cost-efficient, scalable evaluation that can approximate human expert review fidelity (Bedi et al., 26 May 2025).
4. Integration of Structured Medical Summarization: RWESummary
RWESummary is a specialized test integrated into MedHELM, designed for the evaluation of LLMs on summarizing structured Real-World Evidence (RWE) studies (Mukerji et al., 23 Jun 2025). Unlike traditional summarization tasks based on free-text clinical manuscripts, RWESummary uses JSON-formatted RWE paper outputs with data fields for clinical questions, PICOT paper design, baseline covariates, sample sizes, effect estimates (e.g., odds ratios, mean differences, confidence intervals), and significance levels.
RWESummary Evaluation Metrics
Three primary error-type metrics are defined:
- Direction of Effect: Correct assignment of effect sign (positive, negative, no difference).
- Numeric Accuracy: Fidelity in reporting numerical values (ORs, CIs, p-values, sample counts).
- Completeness: Inclusion of all statistically significant outcomes (p ≤ 0.05) in generated summaries.
A secondary metric captures inference time per summary, supporting cost-performance analysis. Binary scoring is conducted by a jury of LLMs. Final scoring employs user-tunable weighted sums of normalized metrics, optionally incorporating inference time (lower is superior) (Mukerji et al., 23 Jun 2025).
5. Experimental Results and Performance Analysis
MedHELM’s empirical suite has evaluated nine advanced LLMs across the benchmarks. Observed metrics include pairwise head-to-head win-rates and macro-averaged normalized scores ([0,1] scale). DeepSeek R1 and o3-mini demonstrated leading win-rates (0.66 and 0.64), while Claude 3.5 Sonnet performed comparably at significantly lower computational cost.
RWESummary results showed that the Gemini 2.5 Pro and Flash models outperformed others on direction of effect (up to 0.949), numeric accuracy (up to 0.974), completeness (perfect in Gemini 2.5 Flash), and maintained low inference times (14.2–14.4s). Weighted rubric application yielded normalized total scores: Gemini 2.5 Pro at 100%, Gemini 2.5 Flash at 99%, and the next-best model (Gemini 2.0 Flash) at 94% (Mukerji et al., 23 Jun 2025).
Category-level MedHELM scores indicated that free-text generation tasks are handled more robustly (scores 0.78–0.85) than structured reasoning or coding tasks (0.53–0.72) (Bedi et al., 26 May 2025).
6. Framework Architecture and Multimodal Extensions
MedHELM builds upon HELM’s (Holistic Evaluation of LLMs) base with architecture-agnostic benchmarks but is directly extensible to multimodal LLMs. MedHELM has been used synonymously with HeLM (“Health LLM for Multimodal Understanding”) in earlier work, notably for evaluating LLMs enriched with high-dimensional clinical signals (e.g., spirograms) and standard tabular features (Belyaeva et al., 2023).
Non-text modalities are mapped into the LLM’s embedding space via dedicated encoders (e.g., ResNet1D for spirograms, MLPs for tabular data), with end-to-end inference allowing all modalities—text, tabular, time series—to interact through self-attention mechanisms in a frozen transformer. Supervised objectives optimize encoder parameters, leaving the core LLM weights unchanged.
MedHELM has been shown to outperform or match classical approaches (e.g., logistic regression, XGBoost) on various risk-estimation tasks (e.g., AUROC 0.75 for asthma on tabular+spirogram vs. 0.68 for logistic regression), and demonstrates plausible generalization to out-of-distribution traits. An observed limitation is that multimodal encoder tuning can degrade free-text conversational quality, highlighting a need for future alignment strategies (Belyaeva et al., 2023).
7. Limitations, Extensions, and Implications
While MedHELM’s LLM-jury demonstrates strong agreement with clinical raters, validation has thus far been performed on a subset of open-ended tasks. Expansion of benchmark coverage and increased instance-level granularity in rubrics is an area for further research, particularly for subjective or ambiguous clinical scenarios. Administrative and workflow tasks remain the lowest-performing area for all tested models, suggesting either model limitations or benchmark design gaps.
The MedHELM framework provides transparent, reproducible evaluation critical for safe and cost-effective LLM deployment in healthcare environments. The integration of structured-scenario tests like RWESummary supports routine and scalable benchmarking of emerging LLMs for RWE tasks within clinical pipelines. A plausible implication is that, as more real-world studies are ingested, MedHELM’s analytical power and clinical relevance will continue to deepen, supporting better model selection and risk mitigation for medical AI deployments (Mukerji et al., 23 Jun 2025, Bedi et al., 26 May 2025, Belyaeva et al., 2023).