PMC-LLaMA: Open-Source Medical LLM
- PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA architecture, optimized with a 13B parameter model for advanced biomedical QA.
- It employs a two-stage domain adaptation process—knowledge injection from vast biomedical texts and instruction tuning with rationales, conversations, and knowledge graphs.
- The model achieves significant improvements on benchmarks like MedQA and PubMedQA, outperforming existing open-source models and even ChatGPT in specific tasks.
PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA @@@@3@@@@ and adapted through data-centric strategies for high performance on medical natural language understanding and question answering tasks. Leveraging integration of large-scale biomedical literature, textbooks, and specialized instruction datasets, PMC-LLaMA sets a new standard for accuracy on several public medical QA benchmarks with a lightweight parameter count of 13 billion.
1. Model Architecture and Parameterization
PMC-LLaMA is based directly on Meta's LLaMA Transformer, utilizing the decoder-only, autoregressive language modeling paradigm. Its foundational configuration includes:
- No modifications to the Transformer block: The multi-head self-attention mechanisms, feed-forward networks, and normalization layers exactly mirror the original LLaMA design.
- Parameter scales for ablation: Experiments were conducted at 7B and 13B parameter scales, with the 13B checkpoint constituting the final release.
- Attention and position encoding: The model uses standard scaled dot-product attention and rotary positional embeddings as in LLaMA.
All architectural hyperparameters, including the number of layers, hidden size, and attention heads, are retained from the public LLaMA-13B specification. This consistency allows direct attribution of improvements to the adaptation and tuning stages rather than to architectural changes.
2. Domain Adaptation Workflow
PMC-LLaMA employs a two-stage domain adaptation pipeline: knowledge injection followed by instruction tuning.
2.1 Knowledge Injection (MedC-K)
- Biomedical corpus: 4.8M papers from S2ORC (filtered for PMC-ID), totaling approximately 75B tokens.
- Textbooks: 30K medical textbooks (from open libraries, university holdings, publishers), comprising roughly 4B tokens.
- General data: RedPajama-Data is interleaved at a batch ratio of 1 (general) : 4 (papers) : 15 (books) to prevent catastrophic forgetting.
- Preprocessing: Uniform PDF-to-text conversion, de-duplication, and removal of non-informative text artifacts.
The loss for this knowledge injection (KI) stage is formalized as:
where is the token sequence.
2.2 Instruction Tuning (MedC-I)
The instruction tuning set (202M tokens) contains:
- Medical conversations: 70M tokens from Med-Alpaca, Chat-Doctor, and paraphrased dialogue (GPT-4 paraphrase prompt).
- Rationale-driven QA: 100M tokens, spanning USMLE, PubMedQA, and MedMCQA, with ChatGPT-generated rationales using both "general-style" and "option-wise" prompts.
- Knowledge-graph prompts: 32M tokens derived from UMLS entity and relation querying.
Formatting follows a strict [INST] <instruction tokens> [/INST] <response tokens> convention. The instruction-tuning loss is:
where is the response token subset, and is the instruction.
Quality assurance procedures include de-duplication, balanced sampling from each component, and randomized shuffling.
3. Training Regimens and Computational Framework
3.1 Knowledge Injection Training
- Input length: 2,048 tokens per sequence
- Batch size: Equivalent to 3,200 tokenized sequences per step
- Optimizer: AdamW (matched to LLaMA)
- Learning rate: , constant
- Floating point: bf16 mixed precision
- Distributed training: FSDP with gradient checkpointing across 32 NVIDIA A100 GPUs
- Epochs: 5 (based on single pass through all textbook tokens)
3.2 Instruction Tuning
- Input length: 2,048 tokens
- Batch size: 256
- Same optimizer and learning rate
- bf16 precision
- Epochs: 3 (202M total tokens per epoch)
- Compute: 8 NVIDIA A100 GPUs
4. Ablation Studies and Performance on Medical QA
Three public QA benchmarks were used: MedQA (USMLE-derived, 4 choices), MedMCQA (medical entrance exams), and PubMedQA (yes/no/maybe QA). The table below summarizes key ablations:
| Method | Size | MedQA | MedMCQA | PubMedQA |
|---|---|---|---|---|
| Baseline LLaMA | 7B | 44.54% | 48.51% | 73.40% |
| Baseline LLaMA | 13B | 45.48% | 51.42% | 76.40% |
| PMC-LLaMAₖ (papers only) | 7B | 44.70% | 50.54% | 69.50% |
| PMC-LLaMAₖ (papers + books) | 7B | 45.56% | 51.45% | 74.60% |
| PMC-LLaMAₖ | 13B | 48.15% | 54.15% | 77.10% |
| PMC-LLaMA (+ rationale only) | 13B | 49.32% | 54.56% | 77.20% |
| PMC-LLaMA (+ rationale + conversation) | 13B | 54.43% | 55.77% | 77.00% |
| PMC-LLaMA (full: rationale+conv+KG) | 13B | 56.36% | 56.04% | 77.90% |
- Model scale (13B) provides consistent gains over 7B.
- Integration of textbooks with papers is superior to papers alone.
- Each component of instruction tuning (rationale, conversation, knowledge-graph) yields additional 1–5% increases in QA accuracy.
5. Benchmarking Against State-of-the-Art
The final PMC-LLaMA model was evaluated in zero-shot instruction setting against existing models and human baselines:
| Method | Model size | MedQA | MedMCQA | PubMedQA | Avg. |
|---|---|---|---|---|---|
| Human (pass) | – | 50.0% | – | 60.0% | – |
| Human (expert) | – | 87.0% | 90.0% | 78.0% | 85.0% |
| ChatGPT | 175B | 57.0% | 44.0% | 63.9% | 54.97% |
| LLaMA-2 | 13B | 42.7% | 37.4% | 68.0% | 49.4% |
| LLaMA-2 | 70B | 43.7% | 35.0% | 74.3% | 51.0% |
| Med-Alpaca | 13B | 30.9% | 31.1% | 53.2% | 38.4% |
| Chat-Doctor | 7B | 33.9% | 31.1% | 54.3% | 39.8% |
| PMC-LLaMA | 13B | 56.4% | 56.0% | 77.9% | 64.4% |
PMC-LLaMA surpasses all open-source models by a wide margin and outperforms ChatGPT by approximately 9.5 percentage points on average with just 1/13th of the parameters.
6. Public Release and Reproducibility
All models, datasets, and codebases for PMC-LLaMA are open-sourced under Apache 2.0 at https://github.com/chaoyi-wu/PMC-LLaMA. The repository includes:
- Pre-trained
PMC-LLaMA-13Bcheckpoint (bf16 precision) - End-to-end training scripts (PyTorch + HuggingFace)
- Dataset curation and download tools (MedC-K and MedC-I)
- Inference examples (zero/few-shot QA)
- Model card, license, and citation guidelines
These resources enable direct evaluation, extension, and further development by the research community.
7. Limitations and Prospective Directions
- Resource intensity: Pretraining required 32 × A100 GPUs for 5 epochs; instruction tuning used 8 × A100 for 3 epochs.
- Conversational coverage: Free-form, open-domain conversation is less capable than models like ChatGPT.
- Domain scope: The focus is on US-style multiple-choice QA; real-world clinical tasks (e.g., EMR, long-form notes, multimodal reasoning) remain untested.
- Future work: Incorporation of clinical records and imaging data; evaluation on open-ended tasks; exploration of parameter-efficient fine-tuning strategies (e.g., LoRA, adapters) to mitigate compute demands.
This suggests that while PMC-LLaMA represents a state-of-the-art open-source medical LLM within evaluation scope, broader clinical applicability and efficiency remain open research directions.