Papers
Topics
Authors
Recent
2000 character limit reached

PMC-LLaMA: Open-Source Medical LLM

Updated 17 November 2025
  • PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA architecture, optimized with a 13B parameter model for advanced biomedical QA.
  • It employs a two-stage domain adaptation process—knowledge injection from vast biomedical texts and instruction tuning with rationales, conversations, and knowledge graphs.
  • The model achieves significant improvements on benchmarks like MedQA and PubMedQA, outperforming existing open-source models and even ChatGPT in specific tasks.

PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA @@@@3@@@@ and adapted through data-centric strategies for high performance on medical natural language understanding and question answering tasks. Leveraging integration of large-scale biomedical literature, textbooks, and specialized instruction datasets, PMC-LLaMA sets a new standard for accuracy on several public medical QA benchmarks with a lightweight parameter count of 13 billion.

1. Model Architecture and Parameterization

PMC-LLaMA is based directly on Meta's LLaMA Transformer, utilizing the decoder-only, autoregressive language modeling paradigm. Its foundational configuration includes:

  • No modifications to the Transformer block: The multi-head self-attention mechanisms, feed-forward networks, and normalization layers exactly mirror the original LLaMA design.
  • Parameter scales for ablation: Experiments were conducted at 7B and 13B parameter scales, with the 13B checkpoint constituting the final release.
  • Attention and position encoding: The model uses standard scaled dot-product attention and rotary positional embeddings as in LLaMA.

All architectural hyperparameters, including the number of layers, hidden size, and attention heads, are retained from the public LLaMA-13B specification. This consistency allows direct attribution of improvements to the adaptation and tuning stages rather than to architectural changes.

2. Domain Adaptation Workflow

PMC-LLaMA employs a two-stage domain adaptation pipeline: knowledge injection followed by instruction tuning.

2.1 Knowledge Injection (MedC-K)

  • Biomedical corpus: 4.8M papers from S2ORC (filtered for PMC-ID), totaling approximately 75B tokens.
  • Textbooks: 30K medical textbooks (from open libraries, university holdings, publishers), comprising roughly 4B tokens.
  • General data: RedPajama-Data is interleaved at a batch ratio of 1 (general) : 4 (papers) : 15 (books) to prevent catastrophic forgetting.
  • Preprocessing: Uniform PDF-to-text conversion, de-duplication, and removal of non-informative text artifacts.

The loss for this knowledge injection (KI) stage is formalized as:

LKI(Φ)=i=1NlogΦ(uiu<i)L_{\mathrm{KI}}(\Phi) = -\sum_{i=1}^N \log \Phi(u_i \mid u_{<i})

where U={u1,,uN}\mathcal{U} = \{u_1, \ldots, u_N\} is the token sequence.

2.2 Instruction Tuning (MedC-I)

The instruction tuning set (202M tokens) contains:

  • Medical conversations: 70M tokens from Med-Alpaca, Chat-Doctor, and paraphrased dialogue (GPT-4 paraphrase prompt).
  • Rationale-driven QA: 100M tokens, spanning USMLE, PubMedQA, and MedMCQA, with ChatGPT-generated rationales using both "general-style" and "option-wise" prompts.
  • Knowledge-graph prompts: 32M tokens derived from UMLS entity and relation querying.

Formatting follows a strict [INST] <instruction tokens> [/INST] <response tokens> convention. The instruction-tuning loss is:

LIT(Φ)=uiRlogΦ(uiu<i,I)L_{\mathrm{IT}}(\Phi) = -\sum_{u_i \in \mathcal{R}} \log \Phi(u_i \mid u_{<i}, \mathcal{I})

where R\mathcal{R} is the response token subset, and I\mathcal{I} is the instruction.

Quality assurance procedures include de-duplication, balanced sampling from each component, and randomized shuffling.

3. Training Regimens and Computational Framework

3.1 Knowledge Injection Training

  • Input length: 2,048 tokens per sequence
  • Batch size: Equivalent to 3,200 tokenized sequences per step
  • Optimizer: AdamW (matched to LLaMA)
  • Learning rate: 2×1052 \times 10^{-5}, constant
  • Floating point: bf16 mixed precision
  • Distributed training: FSDP with gradient checkpointing across 32 NVIDIA A100 GPUs
  • Epochs: 5 (based on single pass through all textbook tokens)

3.2 Instruction Tuning

  • Input length: 2,048 tokens
  • Batch size: 256
  • Same optimizer and learning rate
  • bf16 precision
  • Epochs: 3 (202M total tokens per epoch)
  • Compute: 8 NVIDIA A100 GPUs

4. Ablation Studies and Performance on Medical QA

Three public QA benchmarks were used: MedQA (USMLE-derived, 4 choices), MedMCQA (medical entrance exams), and PubMedQA (yes/no/maybe QA). The table below summarizes key ablations:

Method Size MedQA MedMCQA PubMedQA
Baseline LLaMA 7B 44.54% 48.51% 73.40%
Baseline LLaMA 13B 45.48% 51.42% 76.40%
PMC-LLaMAₖ (papers only) 7B 44.70% 50.54% 69.50%
PMC-LLaMAₖ (papers + books) 7B 45.56% 51.45% 74.60%
PMC-LLaMAₖ 13B 48.15% 54.15% 77.10%
PMC-LLaMA (+ rationale only) 13B 49.32% 54.56% 77.20%
PMC-LLaMA (+ rationale + conversation) 13B 54.43% 55.77% 77.00%
PMC-LLaMA (full: rationale+conv+KG) 13B 56.36% 56.04% 77.90%
  • Model scale (13B) provides consistent gains over 7B.
  • Integration of textbooks with papers is superior to papers alone.
  • Each component of instruction tuning (rationale, conversation, knowledge-graph) yields additional 1–5% increases in QA accuracy.

5. Benchmarking Against State-of-the-Art

The final PMC-LLaMA model was evaluated in zero-shot instruction setting against existing models and human baselines:

Method Model size MedQA MedMCQA PubMedQA Avg.
Human (pass) 50.0% 60.0%
Human (expert) 87.0% 90.0% 78.0% 85.0%
ChatGPT 175B 57.0% 44.0% 63.9% 54.97%
LLaMA-2 13B 42.7% 37.4% 68.0% 49.4%
LLaMA-2 70B 43.7% 35.0% 74.3% 51.0%
Med-Alpaca 13B 30.9% 31.1% 53.2% 38.4%
Chat-Doctor 7B 33.9% 31.1% 54.3% 39.8%
PMC-LLaMA 13B 56.4% 56.0% 77.9% 64.4%

PMC-LLaMA surpasses all open-source models by a wide margin and outperforms ChatGPT by approximately 9.5 percentage points on average with just 1/13th of the parameters.

6. Public Release and Reproducibility

All models, datasets, and codebases for PMC-LLaMA are open-sourced under Apache 2.0 at https://github.com/chaoyi-wu/PMC-LLaMA. The repository includes:

  • Pre-trained PMC-LLaMA-13B checkpoint (bf16 precision)
  • End-to-end training scripts (PyTorch + HuggingFace)
  • Dataset curation and download tools (MedC-K and MedC-I)
  • Inference examples (zero/few-shot QA)
  • Model card, license, and citation guidelines

These resources enable direct evaluation, extension, and further development by the research community.

7. Limitations and Prospective Directions

  • Resource intensity: Pretraining required 32 × A100 GPUs for 5 epochs; instruction tuning used 8 × A100 for 3 epochs.
  • Conversational coverage: Free-form, open-domain conversation is less capable than models like ChatGPT.
  • Domain scope: The focus is on US-style multiple-choice QA; real-world clinical tasks (e.g., EMR, long-form notes, multimodal reasoning) remain untested.
  • Future work: Incorporation of clinical records and imaging data; evaluation on open-ended tasks; exploration of parameter-efficient fine-tuning strategies (e.g., LoRA, adapters) to mitigate compute demands.

This suggests that while PMC-LLaMA represents a state-of-the-art open-source medical LLM within evaluation scope, broader clinical applicability and efficiency remain open research directions.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PMC-LLaMA.