Meditron-70B: Open Medical Language Model
- Meditron-70B is a family of large language models specialized for medicine, built on a 70B parameter decoder-only transformer with advanced attention mechanisms.
- It leverages continued pretraining and fine-tuning on extensive, curated clinical datasets, achieving superior performance on benchmarks like MedQA and PubMedQA.
- The fully open and auditable pipeline ensures transparent model development and reproducibility, supporting clinical decision support systems and medical education.
Meditron-70B denotes a family of LLMs specialized for the medical domain, distinguished by their scale (70 billion parameters), continued pretraining and fine-tuning regimes, and, in the most recent instantiation, a fully open, auditable pipeline. Meditron-70B models are deployed both as open-weight domain-adapted transformers (Chen et al., 2023) and as the first clinical LLMs reproducible end-to-end from transparent data, code, and evaluation recipes (Theimer-Lienhard et al., 15 May 2026). These models have established new performance standards for open-source medical reasoning and clinical decision support systems (CDSS), rivaling or outperforming proprietary or partially closed models in multiple benchmark settings.
1. Underlying Architectures and Training Paradigms
Meditron-70B’s architecture has evolved over two principal phases:
Initial release (Chen et al., 2023): Meditron-70B is a continued-pretraining adaptation of Meta’s Llama-2-70B, leveraging a decoder-only transformer with rotary positional encodings, Grouped-Query Attention, unbound embedding/projection weights, and Falcon’s parallel-attention/MLP ordering. NVIDIA’s Megatron-LM distributed trainer is extended to support these features, integrating FlashAttention/FlashAttention-2 for efficient attention computation and large-scale sequence parallelism. Configuration includes:
- Parameter count: 70B, with 128 transformer layers, 128 attention heads, 8,192 hidden size (SwiGLU activations, RMSNorm).
- Context window: 4,096 tokens.
- Optimizer: AdamW (β₁=0.9, β₂=0.95, ϵ=10⁻⁵, weight decay=0.1, norm clipping=1.0).
- Data and parallelism: 512 tokens/global batch (micro-batch=2 tokens/GPU), distributed over 128 A100 GPUs (Tensor-Parallel=8, Pipeline-Parallel=8, Data-Parallel=2).
The pretraining objective is the standard autoregressive cross-entropy minimization over the domain-adapted vocabulary: with AdamW weight decay as the only regularization.
Fully Open MeditronFO (Theimer-Lienhard et al., 15 May 2026): The most recent Meditron-70B is built upon the fully open Apertus-70B-Instruct backbone—standard transformer blocks, 32-head self-attention, rotary position encodings—retaining its native instruction-chat template. No architectural alterations or adapters are introduced. Fine-tuning is performed via supervised learning on a clinician-audited corpus, optimizing negative log-likelihood only: where denotes the input prompt, the target sequence. No RLHF or policy gradients are applied.
2. Data Sources and Curation Protocols
Meditron-70B (Continued Pretraining) (Chen et al., 2023)
Pretraining is performed on the “GAP-Replay” mixture—a 48.1B-token corpus derived from:
- Clinical Guidelines: 46,469 curated global/national/institutional documents, cleaned and deduplicated.
- PubMed Abstracts: 16.2M biomedical abstracts, 5.48B tokens, with structured annotation for citations, figures, and formulas.
- PubMed Full-text Papers: 4.9M open-access articles, 40.7B tokens, deduplicated by PMCID.
- Experience Replay: 420M tokens (1%) of general-domain data (RedPajama), mitigating catastrophic forgetting.
The final mixture comprises 21.1M samples, 46.7B training tokens, and 1.4B validation tokens, with explicit splits and documentation.
MeditronFO Corpus (Theimer-Lienhard et al., 15 May 2026)
The Fully Open Meditron corpus is constructed to enable full auditability and reproduce-bitwise training runs:
- Curated QA: 216,719 examples (43.9M tokens) across 8 public medical QA datasets (MedQA, MedMCQA, PubMedQA, MedExpQA, HealthSearchQA, LiveQA, AfriMed-QA v1/v2).
- Synthetic Curated QA: 214,654 MCQs/short-answer QAs generated by GPT-OSS-120B, vetted with clinician prompts.
- Guidelines QA: 145,681 MCQs grounded in 46,469 clinical guidelines.
- Synthetic MOOVE: 24,465 long-form vignettes for diagnostic reasoning, derived from MOOVE.
All data are normalized to a conversational (system/user/assistant) format, with stepwise rationales preserved. To control for benchmark leakage, a two-stage decontamination is enforced: n-gram overlap (8-gram exclusion) and token-alignment distance ≤ 0.5, referencing all evaluation sets.
Synthetic generations are further checked via physician-panel auditing and up to eight-fold rejection sampling at ; only gold-label-matched outputs are retained.
3. Training Regimes and Hyperparameters
Meditron-70B (Chen et al., 2023)
Pretraining utilizes:
- Global throughput: ~40,200 tokens/sec, ~42.3% GPU-flop utilization.
- Compute budget: ~42,500 GPU-hours over 128×332 hr.
- Learning schedule: cosine decay with 2,000-step warmup (peak LR , floor ).
Apertus-70B-MeditronFO (Theimer-Lienhard et al., 15 May 2026)
Finetuning settings:
- Hardware: 32 GH200 GPUs (8×4-node clusters), micro-batch=4 sequences/GPU.
- Effective batch: 128 sequences.
- Peak LR: , cosine decay, 10% warmup, no weight decay, β₂=0.999.
- Parallelization: DeepSpeed ZeRO-3, bfloat16, FlashAttention2.
- Wall-clock: 6h 39m (~213 GPU-hr) for covering ~150M tokens; configurations, random seeds, and logs fully released for reproducibility.
No special per-component weighting (pure concatenation of all corpus segments); optional Tülu replay (10% proportion) can supplement general-domain capabilities.
4. Evaluation Methodology and Benchmarks
Structured MCQA Tasks (Meditron-70B & FO):
- MedQA (USMLE-style), MedMCQA (Indian medical entrance), PubMedQA (biomedical research questions), and MMLU-Medical (broad medical subdisciplines; Meditron-70B only).
- Metrics: Raw accuracy; for “Med Avg,” unweighted mean across tasks.
Few-shot and In-context:
- 70B: 5-shot; 7B: 3-shot; Meditron-70B achieves 63.3% (vs. Llama-2-70B 60.8%) in few-shot, with +7.0%, +3.0%, and +0.9% on PubMedQA, MedQA, MedMCQA respectively.
Supervised Finetuning Modes:
- Inference with Top-Token, zero-shot Chain-of-Thought (CoT), Self-Consistency CoT (SC-CoT).
- Meditron-70B achieves 72.0% (SC-CoT) avg. accuracy (Top-Token: 69.0%), outperforming Llama-2-70B across all modes; MedQA-4-option SC-CoT: 70.2% (human passing ≈60%).
Open-ended Rubric Evaluation:
- HealthBench: 1,000 physician-authored clinical conversations, LLM judged (Qwen3-235B-A22B).
- Auto-MOOVE: 24,681 vignettes; model responses scored base vs. FO via randomization, LLM-as-a-judge, and 9-axis Likert ratings.
- LLM judge calibration: inter-rater κ=0.232 (tied) vs. human mean κ=0.320.
Error Analysis:
- Few-shot: std. dev. across three seeds (±0.5–7.3%). No formal p-values.
Representative Example (MedQA):
Q: “Which ultrasound finding has highest aneuploidy association?” Meditron-70B (finetuned SC-CoT): “Cystic hygroma (C)” (correct).
5. Comparative Performance and State-of-the-Art Position
Meditron-70B sets new state-of-the-art among open-weight LLMs for medicine:
- Few-shot: Outperforms Llama-2-70B by 2.5 percentage points average; largest PubMedQA gain (+7.0%).
- Supervised: Meditron-70B surpasses Llama-2-70B by 2.8% (SC-CoT), typically exceeding GPT-3.5 (175B) on all benchmarks, and within 5% of GPT-4, within 10% of Med-PaLM-2 (540B).
- Fully Open Meditron (Apertus-70B-MeditronFO):
- Med Avg accuracy: 53.77% (+6.59pp over Apertus-70B-Instruct base).
- Robust gain on PubMedQA (66.8→75.2%), MedQA (60.6→68.6%), MedMCQA (52.4→56.3%), OOD generalization (MedXpertQA: 12.3→16.9%), HealthBench (43.7→51.9%).
- Auto-MOOVE adjusted win rate: 79.6% (FO vs. base), ΔLikert +0.40.
- Broader Comparisons: OLMo-2-32B-MeditronFO: +1.7pp (51.5→53.2%) Med Avg; EuroLLM-22B-MeditronFO +0.66pp; Gemma-3-27B-MeditronFO is preferred over MedGemma-27B in 58.6% of pairwise evaluations.
Open-source competitors (e.g., Clinical-Camel, Med42) rely on parameter-efficient adaptation; Meditron employs full-parameter continued pretraining and fully auditable fine-tuning.
6. Auditability, Transparency, and Reusability
MeditronFO establishes the first “fully open” pipeline in clinical LLM research (Theimer-Lienhard et al., 15 May 2026):
- All code (data prep, decontamination, training, evaluation), corpus provenance (public QA+guidelines, synthetic template seeds, teacher model configs), and hardware logs are under open license.
- Decontamination logs justify sample removal.
- Synthetic examples are traceable to prompts, seed exemplars, and teacher outputs (with full rejection sampling history).
- Evaluation artifacts (Auto-MOOVE, LLM judgments) public for human reappraisal.
- Run configurations (including Axolotl/SLURM scripts), random seeds, and hardware specs support bit-for-bit re-execution.
This enables transparency at all stages—from data curation through model training to evaluation—addressing regulatory and clinical trust requirements.
7. Limitations, Clinical Implications, and Future Directions
Meditron-70B models are suited for:
- Clinical decision support: evidence retrieval, guideline summarization, and differential diagnosis assistance.
- Medical education: automated assessment, question generation, and case analysis.
Weaknesses:
- Residual hallucinations and citation errors, partially mitigated by data formatting and gold-label resampling.
- Informal safety appraisals indicate improved, but imperfect, refusal on prompts for self-harm/illicit advice.
- Not intended or certified for unsupervised clinical deployment; further alignment, RCT-level validation, and human moderation are required.
Research Directions:
- Instruction tuning, retrieval-augmented and multimodal extension (e.g., integration with medical imaging), bias/safety audits, and open benchmarking with physician-in-the-loop validation.
Significance:
Meditron-70B demonstrates that fully auditable LLM specialization is feasible at state-of-the-art scale, and that open science can deliver both technical excellence and transparency. The Meditron-70B paradigm positions regulatory, institutional, and patient stakeholders to independently verify and adapt LLM-CDSS solutions, setting a blueprint for responsible and reproducible medical AI (Chen et al., 2023, Theimer-Lienhard et al., 15 May 2026).