PMC-LLaMA: Open-Source Medical LLM

Updated 17 November 2025

PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA architecture, optimized with a 13B parameter model for advanced biomedical QA.
It employs a two-stage domain adaptation process—knowledge injection from vast biomedical texts and instruction tuning with rationales, conversations, and knowledge graphs.
The model achieves significant improvements on benchmarks like MedQA and PubMedQA, outperforming existing open-source models and even ChatGPT in specific tasks.

PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA @@@@3@@@@ and adapted through data-centric strategies for high performance on medical natural language understanding and question answering tasks. Leveraging integration of large-scale biomedical literature, textbooks, and specialized instruction datasets, PMC-LLaMA sets a new standard for accuracy on several public medical QA benchmarks with a lightweight parameter count of 13 billion.

1. Model Architecture and Parameterization

PMC-LLaMA is based directly on Meta's LLaMA Transformer, utilizing the decoder-only, autoregressive language modeling paradigm. Its foundational configuration includes:

No modifications to the Transformer block: The multi-head self-attention mechanisms, feed-forward networks, and normalization layers exactly mirror the original LLaMA design.
Parameter scales for ablation: Experiments were conducted at 7B and 13B parameter scales, with the 13B checkpoint constituting the final release.
Attention and position encoding: The model uses standard scaled dot-product attention and rotary positional embeddings as in LLaMA.

All architectural hyperparameters, including the number of layers, hidden size, and attention heads, are retained from the public LLaMA-13B specification. This consistency allows direct attribution of improvements to the adaptation and tuning stages rather than to architectural changes.

2. Domain Adaptation Workflow

PMC-LLaMA employs a two-stage domain adaptation pipeline: knowledge injection followed by instruction tuning.

2.1 Knowledge Injection (MedC-K)

Biomedical corpus: 4.8M papers from S2ORC (filtered for PMC-ID), totaling approximately 75B tokens.
Textbooks: 30K medical textbooks (from open libraries, university holdings, publishers), comprising roughly 4B tokens.
General data: RedPajama-Data is interleaved at a batch ratio of 1 (general) : 4 (papers) : 15 (books) to prevent catastrophic forgetting.
Preprocessing: Uniform PDF-to-text conversion, de-duplication, and removal of non-informative text artifacts.

The loss for this knowledge injection (KI) stage is formalized as:

$L_{\mathrm{KI}}(\Phi) = -\sum_{i=1}^N \log \Phi(u_i \mid u_{<i})$

where $\mathcal{U} = \{u_1, \ldots, u_N\}$ is the token sequence.

2.2 Instruction Tuning (MedC-I)

The instruction tuning set (202M tokens) contains:

Medical conversations: 70M tokens from Med-Alpaca, Chat-Doctor, and paraphrased dialogue (GPT-4 paraphrase prompt).
Rationale-driven QA: 100M tokens, spanning USMLE, PubMedQA, and MedMCQA, with ChatGPT-generated rationales using both "general-style" and "option-wise" prompts.
Knowledge-graph prompts: 32M tokens derived from UMLS entity and relation querying.

Formatting follows a strict [INST] <instruction tokens> [/INST] <response tokens> convention. The instruction-tuning loss is:

$L_{\mathrm{IT}}(\Phi) = -\sum_{u_i \in \mathcal{R}} \log \Phi(u_i \mid u_{<i}, \mathcal{I})$

where $\mathcal{R}$ is the response token subset, and $\mathcal{I}$ is the instruction.

Quality assurance procedures include de-duplication, balanced sampling from each component, and randomized shuffling.

3. Training Regimens and Computational Framework

3.1 Knowledge Injection Training

Input length: 2,048 tokens per sequence
Batch size: Equivalent to 3,200 tokenized sequences per step
Optimizer: AdamW (matched to LLaMA)
Learning rate: $2 \times 10^{-5}$ , constant
Floating point: bf16 mixed precision
Distributed training: FSDP with gradient checkpointing across 32 NVIDIA A100 GPUs
Epochs: 5 (based on single pass through all textbook tokens)

3.2 Instruction Tuning

Input length: 2,048 tokens
Batch size: 256
Same optimizer and learning rate
bf16 precision
Epochs: 3 (202M total tokens per epoch)
Compute: 8 NVIDIA A100 GPUs

4. Ablation Studies and Performance on Medical QA

Three public QA benchmarks were used: MedQA (USMLE-derived, 4 choices), MedMCQA (medical entrance exams), and PubMedQA (yes/no/maybe QA). The table below summarizes key ablations:

Method	Size	MedQA	MedMCQA	PubMedQA
Baseline LLaMA	7B	44.54%	48.51%	73.40%
Baseline LLaMA	13B	45.48%	51.42%	76.40%
PMC-LLaMAₖ (papers only)	7B	44.70%	50.54%	69.50%
PMC-LLaMAₖ (papers + books)	7B	45.56%	51.45%	74.60%
PMC-LLaMAₖ	13B	48.15%	54.15%	77.10%
PMC-LLaMA (+ rationale only)	13B	49.32%	54.56%	77.20%
PMC-LLaMA (+ rationale + conversation)	13B	54.43%	55.77%	77.00%
PMC-LLaMA (full: rationale+conv+KG)	13B	56.36%	56.04%	77.90%

Model scale (13B) provides consistent gains over 7B.
Integration of textbooks with papers is superior to papers alone.
Each component of instruction tuning (rationale, conversation, knowledge-graph) yields additional 1–5% increases in QA accuracy.

5. Benchmarking Against State-of-the-Art

The final PMC-LLaMA model was evaluated in zero-shot instruction setting against existing models and human baselines:

Method	Model size	MedQA	MedMCQA	PubMedQA	Avg.
Human (pass)	–	50.0%	–	60.0%	–
Human (expert)	–	87.0%	90.0%	78.0%	85.0%
ChatGPT	175B	57.0%	44.0%	63.9%	54.97%
LLaMA-2	13B	42.7%	37.4%	68.0%	49.4%
LLaMA-2	70B	43.7%	35.0%	74.3%	51.0%
Med-Alpaca	13B	30.9%	31.1%	53.2%	38.4%
Chat-Doctor	7B	33.9%	31.1%	54.3%	39.8%
PMC-LLaMA	13B	56.4%	56.0%	77.9%	64.4%

PMC-LLaMA surpasses all open-source models by a wide margin and outperforms ChatGPT by approximately 9.5 percentage points on average with just 1/13th of the parameters.

6. Public Release and Reproducibility

All models, datasets, and codebases for PMC-LLaMA are open-sourced under Apache 2.0 at https://github.com/chaoyi-wu/PMC-LLaMA. The repository includes:

Pre-trained PMC-LLaMA-13B checkpoint (bf16 precision)
End-to-end training scripts (PyTorch + HuggingFace)
Dataset curation and download tools (MedC-K and MedC-I)
Inference examples (zero/few-shot QA)
Model card, license, and citation guidelines

These resources enable direct evaluation, extension, and further development by the research community.

7. Limitations and Prospective Directions

Resource intensity: Pretraining required 32 × A100 GPUs for 5 epochs; instruction tuning used 8 × A100 for 3 epochs.
Conversational coverage: Free-form, open-domain conversation is less capable than models like ChatGPT.
Domain scope: The focus is on US-style multiple-choice QA; real-world clinical tasks (e.g., EMR, long-form notes, multimodal reasoning) remain untested.
Future work: Incorporation of clinical records and imaging data; evaluation on open-ended tasks; exploration of parameter-efficient fine-tuning strategies (e.g., LoRA, adapters) to mitigate compute demands.

This suggests that while PMC-LLaMA represents a state-of-the-art open-source medical LLM within evaluation scope, broader clinical applicability and efficiency remain open research directions.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PMC-LLaMA.