Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Published 15 May 2026 in cs.AI and cs.CL | (2605.16215v1)

Abstract: Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Summary

  • The paper introduces the first fully open pipeline for clinical LLMs that ensures complete auditability of data, processes, and generation.
  • It integrates clinician-audited public datasets with synthetic and guideline-derived QA to enhance coverage of emergency and life-threatening cases.
  • Empirical results demonstrate state-of-the-art accuracy and clinical preference improvements, underpinning reproducible and safe medical AI deployment.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Motivation and Conceptual Foundations

Clinical deployment of LLMs imposes requirements unattainable by conventional open-weight approaches: scrutability, auditability, and reproducibility across the complete data, code, and adaptation pipeline. Most medical LLMs labeled as "open" refer only to parameter release, not to full data, process, and generation pipeline transparency, leaving researchers and clinicians unable to inspect or reproduce critical sources of behavioral bias, contamination, or misalignment. The absence of any medical LLM with true full-stack transparency inhibits robust regulatory and clinical validation. Fully Open Meditron addresses this gap by introducing the first truly fully open (FO) pipeline for LLM-CDSS, differentiating itself from open-weight models in both process and resource accessibility. Figure 1

Figure 2: Openness taxonomy spanning closed, open-weight, partially open, and fully open medical LLMs according to released artifacts and provenance.

Auditable Corpus Construction and Data Management

Fully Open Meditron centers its knowledge base on a clinician-audited, open-source corpus uniting eight public medical QA datasets (MedQA, MedMCQA, PubMedQA, MedExpQA, HealthSearchQA, LiveQA, AfriMed-QA v1/v2), augmented by extensive synthetic expansion along three axes: exam-style QA, guideline-derived QA (46,469 clinical guidelines from 16 global institutions), and open-ended vignettes from expert-written MOOVE scenarios. Hallucinations in synthetic segments are mitigated by gold-label rejection sampling. Distributional and compositional analyses reveal that naïve aggregation of public benchmarks dramatically underrepresents urgent, life-threatening, and low-resource clinical contexts. Clinician-guided synthetic extension (incorporating global health, emergency, pediatrics, and epidemiological context) increases the share of emergency-care coverage from 15.0% to 38.7% and life-threatening cases from 8.6% to 31.8% in relevant subsets, as demonstrated in content stratification studies. The pipeline enforces system-wide, rigorous decontamination (n-gram and token alignment) against evaluation references, minimizing data leakage and contamination. Figure 3

Figure 3: Pipeline for construction of the MeditronFO corpus integrating curated public QA, guidelines, expert vignettes, and synthetic QA with clinician audit and decontamination.

Compositionally, synthetic data comprises ∼64% of the final 601,519-example, 150M-token corpus. Specialty, urgency, and difficulty redistributions are confirmed by zero-shot LLM classifiers, supporting the pipeline's efficacy in mitigating prior coverage gaps. Figure 4

Figure 4: Breakdown of component counts in the fully open Meditron datasets.

Model Training, Instruction Alignment, and Evaluation Protocols

The corpus is used to supervise the fine-tuning of five fully open base models spanning the Apertus, OLMo, and EuroLLM families across various parameter scales (8B–70B). Training uses fully reproducible Axolotl configurations, preserving native chat templates and utilizing bfloat16 mixed precision, ZeRO-3 or FSDP v2 strategies, sample-packing at sequence length 4096, and non-proprietary infrastructure.

To surpass the limitations of traditional multiple-choice clinical evaluation, two major open-ended evaluation strategies are employed:

  • HealthBench: Open-ended, rubric-driven scoring (multi-axis physician-authored).
  • Auto-MOOVE: LLM-as-a-judge framework measuring pairwise clinical preference and per-criterion Likert ratings (criteria include harm, fairness, contextual awareness; 9 axes in total). Judge validation against 204 human raters yields a Cohen's κ statistically indistinguishable from median panelists. Figure 5

Figure 5

Figure 5: Auto-MOOVE pairwise win rates for MeditronFO variants against base models and MedGemma, showing consistent model preferences.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Averaged per-criterion Likert profiles across nine axes for MeditronFO and corresponding base models, indicating multidimensional improvements.

Empirical and Comparative Results

Apertus-70B-MeditronFO establishes a new state-of-the-art among fully open medical specialists, improving from 47.2% to 53.8% average accuracy across MedMCQA, MedQA, PubMedQA, MedXpertQA, and HealthBench. Across all MeditronFO-finetuned models, consistent gains are observed both in accuracy (up to +12.8 absolute for smaller EuroLLM/Apertus bases) and open-ended clinical preference (Auto-MOOVE adjusted win rates 67%–92%). In cross-family evaluation, Gemma-3-27B-MeditronFO outperforms MedGemma (58.6% vs 55.9% HealthBench; preference in 58.6% LLM-judge pairwise contests), directly contradicting the assumption that proprietary synthetic or undisclosed data pipelines are necessary for clinical SoTA.

Ablation studies on corpus components reveal that exam-style QA and Clinical Vignette data are crucial not only for MCQA metrics but for open-ended preference and reasoning axes as well; removal of guideline QA slightly raises MCQA but does not improve open-interaction competence. Removal of synthetic MOOVE scenarios, in particular, significantly lowers model ratings on contextual and reasoning dimensions, substantiating the necessity of comprehensive training coverage.

Robustness, Limitations, and Implications

Despite system-wide decontamination, the syntactic (not strictly semantic) nature of contamination filtering leaves residual risk of paraphrased benchmark leakage. The single-teacher/single-judge constraint introduces stylistic biases, though ablation across judge models confirms robustness of the main claims. Auto-MOOVE, though validated against human raters, underperforms human discriminators—especially on safety-centric axes such as harmlessness/fairness—as corroborated by agreement studies.

Pragmatically, the fully open pipeline enables transparent, reproducible development and evaluation of LLM-CDSS, directly facilitating regulatory, clinical, and academic auditing. The work demonstrates FO training does not inherently result in inferior clinical LLMs; with disciplined validation and corpus construction, it is feasible to achieve or approach proprietary, closed-data performance (particularly when model and compute scales are matched). Figure 2

Figure 7: Timeline of HealthBench performance across closed, open-weight, and fully open medical LLMs, highlighting the closing gap due to MeditronFO models.

Future Directions

Potential avenues include integration of open-preference optimization, semantic decontamination, deeper clinician-in-the-loop supervision, and generalization to multilingual settings. Expanding semantic analysis of contamination, optimizing replay schedules for generalist instruction capability retention, and deploying end-to-end FO teacher models to minimize transfer of biases in synthetic data remain to be pursued. These would further consolidate the auditability and global applicability of open medical LLMs in both research and practical deployments.

Conclusion

Fully Open Meditron demonstrates, with strong empirical support, that rigorous auditing, clinician-guided data construction, and transparent adaptation pipelines can yield clinically competitive medical LLMs without recourse to undisclosed or proprietary resources. The FO paradigm, as instantiated here, balances domain-specific metric improvement with critical requirements for auditable, reproducible, and safe clinical model deployment. The resources and process disclosed establish a reusable foundation for further development in open medical AI and set a new benchmark for transparency and scientific rigor in high-stakes LLM adaptations. Figure 8

Figure 9: Distribution of per-rater κ\kappa among the human panel with Auto-MOOVE judge's agreement, validating LLM-judge equivalency for clinical reasoning preference evaluation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 9 likes about this paper.