Papers
Topics
Authors
Recent
Search
2000 character limit reached

Meditron-70B: Open Medical Language Model

Updated 29 May 2026
  • Meditron-70B is a family of large language models specialized for medicine, built on a 70B parameter decoder-only transformer with advanced attention mechanisms.
  • It leverages continued pretraining and fine-tuning on extensive, curated clinical datasets, achieving superior performance on benchmarks like MedQA and PubMedQA.
  • The fully open and auditable pipeline ensures transparent model development and reproducibility, supporting clinical decision support systems and medical education.

Meditron-70B denotes a family of LLMs specialized for the medical domain, distinguished by their scale (70 billion parameters), continued pretraining and fine-tuning regimes, and, in the most recent instantiation, a fully open, auditable pipeline. Meditron-70B models are deployed both as open-weight domain-adapted transformers (Chen et al., 2023) and as the first clinical LLMs reproducible end-to-end from transparent data, code, and evaluation recipes (Theimer-Lienhard et al., 15 May 2026). These models have established new performance standards for open-source medical reasoning and clinical decision support systems (CDSS), rivaling or outperforming proprietary or partially closed models in multiple benchmark settings.

1. Underlying Architectures and Training Paradigms

Meditron-70B’s architecture has evolved over two principal phases:

Initial release (Chen et al., 2023): Meditron-70B is a continued-pretraining adaptation of Meta’s Llama-2-70B, leveraging a decoder-only transformer with rotary positional encodings, Grouped-Query Attention, unbound embedding/projection weights, and Falcon’s parallel-attention/MLP ordering. NVIDIA’s Megatron-LM distributed trainer is extended to support these features, integrating FlashAttention/FlashAttention-2 for efficient attention computation and large-scale sequence parallelism. Configuration includes:

  • Parameter count: 70B, with 128 transformer layers, 128 attention heads, 8,192 hidden size (SwiGLU activations, RMSNorm).
  • Context window: 4,096 tokens.
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, ϵ=10⁻⁵, weight decay=0.1, norm clipping=1.0).
  • Data and parallelism: 512 tokens/global batch (micro-batch=2 tokens/GPU), distributed over 128 A100 GPUs (Tensor-Parallel=8, Pipeline-Parallel=8, Data-Parallel=2).

The pretraining objective is the standard autoregressive cross-entropy minimization over the domain-adapted vocabulary: L(θ)=i=1Nlogpθ(wiw<i),\mathcal{L}(\theta) = -\sum_{i=1}^N \log p_\theta(w_i|w_{<i}), with AdamW weight decay as the only regularization.

Fully Open MeditronFO (Theimer-Lienhard et al., 15 May 2026): The most recent Meditron-70B is built upon the fully open Apertus-70B-Instruct backbone—standard transformer blocks, 32-head self-attention, rotary position encodings—retaining its native instruction-chat template. No architectural alterations or adapters are introduced. Fine-tuning is performed via supervised learning on a clinician-audited corpus, optimizing negative log-likelihood only: L(θ)=i=1nt=1Tilogpθ(yi,tyi,<t,xi),L(\theta) = -\sum_{i=1}^{n} \sum_{t=1}^{T_i} \log p_\theta(y_{i,t}|y_{i,<t}, x_i), where xix_i denotes the input prompt, yiy_{i} the target sequence. No RLHF or policy gradients are applied.

2. Data Sources and Curation Protocols

Pretraining is performed on the “GAP-Replay” mixture—a 48.1B-token corpus derived from:

  • Clinical Guidelines: 46,469 curated global/national/institutional documents, cleaned and deduplicated.
  • PubMed Abstracts: 16.2M biomedical abstracts, 5.48B tokens, with structured annotation for citations, figures, and formulas.
  • PubMed Full-text Papers: 4.9M open-access articles, 40.7B tokens, deduplicated by PMCID.
  • Experience Replay: 420M tokens (1%) of general-domain data (RedPajama), mitigating catastrophic forgetting.

The final mixture comprises 21.1M samples, 46.7B training tokens, and 1.4B validation tokens, with explicit splits and documentation.

The Fully Open Meditron corpus is constructed to enable full auditability and reproduce-bitwise training runs:

  • Curated QA: 216,719 examples (43.9M tokens) across 8 public medical QA datasets (MedQA, MedMCQA, PubMedQA, MedExpQA, HealthSearchQA, LiveQA, AfriMed-QA v1/v2).
  • Synthetic Curated QA: 214,654 MCQs/short-answer QAs generated by GPT-OSS-120B, vetted with clinician prompts.
  • Guidelines QA: 145,681 MCQs grounded in 46,469 clinical guidelines.
  • Synthetic MOOVE: 24,465 long-form vignettes for diagnostic reasoning, derived from MOOVE.

All data are normalized to a conversational (system/user/assistant) format, with stepwise rationales preserved. To control for benchmark leakage, a two-stage decontamination is enforced: n-gram overlap (8-gram exclusion) and token-alignment distance ≤ 0.5, referencing all evaluation sets.

Synthetic generations are further checked via physician-panel auditing and up to eight-fold rejection sampling at T=0.7T=0.7; only gold-label-matched outputs are retained.

3. Training Regimes and Hyperparameters

Pretraining utilizes:

  • Global throughput: ~40,200 tokens/sec, ~42.3% GPU-flop utilization.
  • Compute budget: ~42,500 GPU-hours over 128×332 hr.
  • Learning schedule: cosine decay with 2,000-step warmup (peak LR 1.5×1041.5 \times 10^{-4}, floor 1.6×1051.6 \times 10^{-5}).

Finetuning settings:

  • Hardware: 32 GH200 GPUs (8×4-node clusters), micro-batch=4 sequences/GPU.
  • Effective batch: 128 sequences.
  • Peak LR: 1×1051 \times 10^{-5}, cosine decay, 10% warmup, no weight decay, β₂=0.999.
  • Parallelization: DeepSpeed ZeRO-3, bfloat16, FlashAttention2.
  • Wall-clock: 6h 39m (~213 GPU-hr) for covering ~150M tokens; configurations, random seeds, and logs fully released for reproducibility.

No special per-component weighting (pure concatenation of all corpus segments); optional Tülu replay (10% proportion) can supplement general-domain capabilities.

4. Evaluation Methodology and Benchmarks

Structured MCQA Tasks (Meditron-70B & FO):

  • MedQA (USMLE-style), MedMCQA (Indian medical entrance), PubMedQA (biomedical research questions), and MMLU-Medical (broad medical subdisciplines; Meditron-70B only).
  • Metrics: Raw accuracy; for “Med Avg,” unweighted mean across tasks.

Few-shot and In-context:

  • 70B: 5-shot; 7B: 3-shot; Meditron-70B achieves 63.3% (vs. Llama-2-70B 60.8%) in few-shot, with +7.0%, +3.0%, and +0.9% on PubMedQA, MedQA, MedMCQA respectively.

Supervised Finetuning Modes:

  • Inference with Top-Token, zero-shot Chain-of-Thought (CoT), Self-Consistency CoT (SC-CoT).
  • Meditron-70B achieves 72.0% (SC-CoT) avg. accuracy (Top-Token: 69.0%), outperforming Llama-2-70B across all modes; MedQA-4-option SC-CoT: 70.2% (human passing ≈60%).

Open-ended Rubric Evaluation:

  • HealthBench: 1,000 physician-authored clinical conversations, LLM judged (Qwen3-235B-A22B).
  • Auto-MOOVE: 24,681 vignettes; model responses scored base vs. FO via randomization, LLM-as-a-judge, and 9-axis Likert ratings.
  • LLM judge calibration: inter-rater κ=0.232 (tied) vs. human mean κ=0.320.

Error Analysis:

  • Few-shot: std. dev. across three seeds (±0.5–7.3%). No formal p-values.

Representative Example (MedQA):

Q: “Which ultrasound finding has highest aneuploidy association?” Meditron-70B (finetuned SC-CoT): “Cystic hygroma (C)” (correct).

5. Comparative Performance and State-of-the-Art Position

Meditron-70B sets new state-of-the-art among open-weight LLMs for medicine:

  • Few-shot: Outperforms Llama-2-70B by 2.5 percentage points average; largest PubMedQA gain (+7.0%).
  • Supervised: Meditron-70B surpasses Llama-2-70B by 2.8% (SC-CoT), typically exceeding GPT-3.5 (175B) on all benchmarks, and within 5% of GPT-4, within 10% of Med-PaLM-2 (540B).
  • Fully Open Meditron (Apertus-70B-MeditronFO):
    • Med Avg accuracy: 53.77% (+6.59pp over Apertus-70B-Instruct base).
    • Robust gain on PubMedQA (66.8→75.2%), MedQA (60.6→68.6%), MedMCQA (52.4→56.3%), OOD generalization (MedXpertQA: 12.3→16.9%), HealthBench (43.7→51.9%).
    • Auto-MOOVE adjusted win rate: 79.6% (FO vs. base), ΔLikert +0.40.
  • Broader Comparisons: OLMo-2-32B-MeditronFO: +1.7pp (51.5→53.2%) Med Avg; EuroLLM-22B-MeditronFO +0.66pp; Gemma-3-27B-MeditronFO is preferred over MedGemma-27B in 58.6% of pairwise evaluations.

Open-source competitors (e.g., Clinical-Camel, Med42) rely on parameter-efficient adaptation; Meditron employs full-parameter continued pretraining and fully auditable fine-tuning.

6. Auditability, Transparency, and Reusability

MeditronFO establishes the first “fully open” pipeline in clinical LLM research (Theimer-Lienhard et al., 15 May 2026):

  • All code (data prep, decontamination, training, evaluation), corpus provenance (public QA+guidelines, synthetic template seeds, teacher model configs), and hardware logs are under open license.
  • Decontamination logs justify sample removal.
  • Synthetic examples are traceable to prompts, seed exemplars, and teacher outputs (with full rejection sampling history).
  • Evaluation artifacts (Auto-MOOVE, LLM judgments) public for human reappraisal.
  • Run configurations (including Axolotl/SLURM scripts), random seeds, and hardware specs support bit-for-bit re-execution.

This enables transparency at all stages—from data curation through model training to evaluation—addressing regulatory and clinical trust requirements.

7. Limitations, Clinical Implications, and Future Directions

Meditron-70B models are suited for:

  • Clinical decision support: evidence retrieval, guideline summarization, and differential diagnosis assistance.
  • Medical education: automated assessment, question generation, and case analysis.

Weaknesses:

  • Residual hallucinations and citation errors, partially mitigated by data formatting and gold-label resampling.
  • Informal safety appraisals indicate improved, but imperfect, refusal on prompts for self-harm/illicit advice.
  • Not intended or certified for unsupervised clinical deployment; further alignment, RCT-level validation, and human moderation are required.

Research Directions:

  • Instruction tuning, retrieval-augmented and multimodal extension (e.g., integration with medical imaging), bias/safety audits, and open benchmarking with physician-in-the-loop validation.

Significance:

Meditron-70B demonstrates that fully auditable LLM specialization is feasible at state-of-the-art scale, and that open science can deliver both technical excellence and transparency. The Meditron-70B paradigm positions regulatory, institutional, and patient stakeholders to independently verify and adapt LLM-CDSS solutions, setting a blueprint for responsible and reproducible medical AI (Chen et al., 2023, Theimer-Lienhard et al., 15 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Meditron-70B.