Meditron-70B: Open Medical LLM
- Meditron-70B is a 70-billion-parameter open-source large language model tailored for medicine, leveraging curated biomedical corpora for its pretraining.
- It uses a decoder-only Transformer architecture with 80 layers, rotary positional embeddings, and FlashAttention, scaled across 128 Nvidia A100 GPUs.
- Empirical evaluations show improved performance on general medical QA benchmarks, though it faces challenges in complex neurological clinical reasoning.
Meditron-70B is a 70-billion-parameter open-source LLM domain-specialized for medicine through continued pretraining on curated biomedical corpora. Developed by EPFL/SYale and released under an open-source license, Meditron-70B extends the Llama-2 architecture and is notable both for its technical scale within the open-source domain and for its empirical evaluation on a variety of medical question answering tasks, including specialty benchmarks in neurology. Meditron-70B is designed to democratize access to high-capability medical LLMs and serves as a reference point for the limits and opportunities of domain-adapted large models (Chen et al., 2023, Sorka et al., 10 Aug 2025).
1. Model Architecture and Scaling
Meditron-70B is a decoder-only Transformer following the architecture of Llama-2-70B. It comprises 80 transformer layers with a model hidden size , 64 attention heads per layer, and approximately parameters. Key architectural features inherited from Llama-2 include rotary positional embeddings, SwiGLU activation functions, RMSNorm for normalization, group-query attention (GQA) for efficient scaling, and FlashAttention/FlashAttention-2 for optimized attention kernel computation. Meditron-70B supports a 128,000 token context window and outputs up to 2,000 tokens per call via its Ollama deployment (Chen et al., 2023, Sorka et al., 10 Aug 2025).
Scalability is achieved using the Megatron-LM framework, which incorporates pipeline, tensor, and data parallelism across 128 Nvidia A100 80GB GPUs (16 nodes, 8 GPUs per node), supported by AMD EPYC 7543 CPUs and 512GB RAM/node. Mixed-precision (bfloat16) activation and activation recomputation are used for memory efficiency. The model achieves a throughput of 40,200 tokens/s at bfloat16, corresponding to FLOPs/s (Chen et al., 2023).
2. Data Curation and Pretraining Procedure
Meditron-70B underwent continued pretraining using a four-way “GAP-Replay” data mixture totaling 48.1 billion tokens:
- Clinical guidelines (107M tokens): 41K documents from agencies such as WHO, CDC, and NICE, filtered and cleaned.
- PubMed abstracts (5.48B tokens): 15.7M records, deduplicated and linguistically filtered.
- PubMed full-text papers (40.7B tokens): 4.9M articles from S2ORC/PMC with markup for references, figures, tables, and formulas standardized.
- Experience replay (420M tokens): A 1% mixture from RedPajama, Wikipedia, and StackExchange, used to mitigate catastrophic forgetting of general-domain knowledge.
Pretraining was executed for 23,000 iterations, processing approximately 31B tokens (~42,500 GPU hours), resulting in total estimated carbon emissions of 486 kg CO₂ (Chen et al., 2023).
The pretraining objective is standard next-token prediction (cross-entropy loss): AdamW optimization is used with , , , weight decay 0.1, grad clip 1.0; a cosine LR schedule (peak , warmup 2,000 steps) is employed. The SentencePiece tokenizer, with a vocabulary of 32k subword units and special biomedical tokens, ensures robust handling of domain-specific nomenclature (Chen et al., 2023).
3. Methodological Evaluation on Medical Benchmarks
Meditron-70B’s QA capabilities were assessed using four major medical benchmarks: MedQA (USMLE-style, 4-option), MedMCQA (Indian medical exams), PubMedQA (yes/no/maybe answer from abstract context), and the MMLU-Medical subset (nine medical subfields). Evaluations include few-shot prompting (3-shot for 7B, 5-shot for 70B), as well as supervised fine-tuning (3 epochs, LR=, batch size 64, ChatML format) with inferences performed by top-token, zero-shot chain-of-thought, and self-consistency strategies (Chen et al., 2023).
| Model | MedQA-4opt | PubMedQA | MedMCQA | MMLU-Medical | Avg |
|---|---|---|---|---|---|
| Llama-2-70B | 58.4% | 72.8% | 52.4% | 71.3% | 60.8% |
| Meditron-70B | 59.8% | 79.8% | 53.3% | 71.5% | 63.3% |
Meditron-70B demonstrates an average 2.5-point improvement over the original Llama-2-70B, with the strongest absolute gains on PubMedQA. Under supervised fine-tuning and self-consistency chain-of-thought, Meditron-70B achieves an average benchmark performance of 72.0%, outperforming Llama-2-70B (69.2%) and exceeding OpenAI’s GPT-3.5-175B on all tasks, while trailing Med-PaLM-2 and GPT-4 by ≤10 percentage points (Chen et al., 2023).
4. Neurological Clinical Reasoning: Specialized Assessment
Meditron-70B was evaluated on a benchmark derived from the Israeli Board Certification Exams in Neurology (305 MCQs across 13 subspecialties) as well as the MedQA neurological subset (155 questions). Complexity dimensions include Factual Knowledge Depth (FKD), Clinical Concept Integration (CCI), and Reasoning Complexity (RC), each graded on a three-level scale.
On the board exam, Meditron-70B achieves 52.9% base accuracy (F1=0.692), the lowest among all 70B models evaluated. For reference, OpenAI-01 attains 90.9%, LLaMA 3.3-70B 69.5%, and OpenBioLLM-70B 65.9%. Meditron-70B’s performance further degrades under retrieval-augmented generation (RAG) to 41.2% (p=0.004), suggesting incompatibility between its internal representations and external textbook inputs. No multi-agent (agentic) approach was applied to Meditron-70B in these experiments (Sorka et al., 10 Aug 2025).
| Model | Base Accuracy | RAG Accuracy | Agent Accuracy |
|---|---|---|---|
| OpenAI-01 | 90.9% | 92.2% | 94.6% |
| GPT-4o | 80.5% | 87.3% | 89.3% |
| LLaMA 3.3-70B | 69.5% | 73.4% | 89.2% |
| OpenBioLLM-70B | 65.9% | 68.8% | – |
| Meditron-70B | 52.9% | 41.2% | – |
Performance by Board Exam complexity (base Meditron-70B):
- FKD Level 3: ≈56%
- CCI Level 3: ≈60%
- RC Level 3: ≈58%
For certain subspecialties, accuracy is markedly low (e.g., neuromuscular disorders: ~32%, neuroimmunology: ~39%, CSF disorders: ~39%, headache & dizziness: ~46%) (Sorka et al., 10 Aug 2025).
5. Empirical Findings, Error Analysis, and Model Limitations
Error analysis reveals that Meditron-70B is especially challenged by:
- Multistep temporal/probabilistic reasoning (RC Level 3)
- Integration of multiple (>3) subspecialty concepts (CCI Level 3)
- Subtle pathophysiological distinctions (e.g., differentiating paraneoplastic vs. radiation-induced syndromes)
The observed performance drop under RAG — uniquely detrimental compared to general-domain models — suggests conflicts in factual style or knowledge framing between pretrained weights and external retrievals. The absence of explicit, clinical-reasoning–oriented fine-tuning and the reliance on generic biomedical corpora likely explain weaknesses on board-exam vignettes demanding complex reasoning (Sorka et al., 10 Aug 2025).
A plausible implication is that domain-adapted pretraining alone is insufficient for specialty-level reasoning unless it is supplemented with curricula or objectives congruent with the target clinical tasks.
6. Comparative Analysis and Architectural Innovations
Unlike Meditron-70B, other models (most notably LLaMA 3.3-70B) demonstrated substantial gains under multi-agent (agentic) frameworks, which decompose clinical reasoning into discrete steps: question parsing, retrieval, synthesis, and output validation. For LLaMA 3.3-70B, this approach improved board-exam accuracy from 69.5% to 89.2%. Meditron-70B, lacking such structuring, remains bound by its pretraining distributional assumptions (Sorka et al., 10 Aug 2025).
These findings indicate that further scaling or additional biomedical data are not by themselves sufficient. Instead, model architectural interventions through agentic decomposition — structuring model inference into modular cognitive stages — are required for domain-specialized LLMs to approach human-level performance in complex medical reasoning tasks.
7. Open Release, Licensing, and Prospects for Enhancement
Meditron-70B’s code, weights, and pretraining scripts are fully available under permissive open-source licensing (Apache 2.0 for code, CC BY-NC for weights), with the curation pipeline and a large fraction of the guidelines dataset public on HuggingFace and GitHub. This release supports independent benchmarking, auditing, and downstream task adaptation by academic and clinical research communities (Chen et al., 2023).
Recommended directions for enhancing Meditron-70B’s clinical reasoning include:
- Incorporating fine-tuning tasks targeted to multi-step clinical case resolution and simulation of clinical dialogue
- Aligning the retrieval corpus with actual exam vignette styles to reduce representational conflict
- Embedding structured agentic inference (consistent with cognitive processes)
- Incorporating multimodal data (e.g., imaging, diagrams) to extend beyond text-based evaluations
In summary, Meditron-70B defines a leading open baseline for medical-domain LLMs. Its architecture and dataset advances enable competitive general medical QA, yet it underscores the enduring challenge of achieving specialty clinical reasoning performance without explicit reasoning-oriented training or agentic inference structure (Chen et al., 2023, Sorka et al., 10 Aug 2025).