Papers
Topics
Authors
Recent
2000 character limit reached

PMC-LLaMA: Open-Source Medical LLM

Updated 17 November 2025
  • PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA architecture, optimized with a 13B parameter model for advanced biomedical QA.
  • It employs a two-stage domain adaptation process—knowledge injection from vast biomedical texts and instruction tuning with rationales, conversations, and knowledge graphs.
  • The model achieves significant improvements on benchmarks like MedQA and PubMedQA, outperforming existing open-source models and even ChatGPT in specific tasks.

PMC-LLaMA is an open-source medical LLM built on Meta's LLaMA @@@@3@@@@ and adapted through data-centric strategies for high performance on medical natural language understanding and question answering tasks. Leveraging integration of large-scale biomedical literature, textbooks, and specialized instruction datasets, PMC-LLaMA sets a new standard for accuracy on several public medical QA benchmarks with a lightweight parameter count of 13 billion.

1. Model Architecture and Parameterization

PMC-LLaMA is based directly on Meta's LLaMA Transformer, utilizing the decoder-only, autoregressive language modeling paradigm. Its foundational configuration includes:

  • No modifications to the Transformer block: The multi-head self-attention mechanisms, feed-forward networks, and normalization layers exactly mirror the original LLaMA design.
  • Parameter scales for ablation: Experiments were conducted at 7B and 13B parameter scales, with the 13B checkpoint constituting the final release.
  • Attention and position encoding: The model uses standard scaled dot-product attention and rotary positional embeddings as in LLaMA.

All architectural hyperparameters, including the number of layers, hidden size, and attention heads, are retained from the public LLaMA-13B specification. This consistency allows direct attribution of improvements to the adaptation and tuning stages rather than to architectural changes.

2. Domain Adaptation Workflow

PMC-LLaMA employs a two-stage domain adaptation pipeline: knowledge injection followed by instruction tuning.

2.1 Knowledge Injection (MedC-K)

  • Biomedical corpus: 4.8M papers from S2ORC (filtered for PMC-ID), totaling approximately 75B tokens.
  • Textbooks: 30K medical textbooks (from open libraries, university holdings, publishers), comprising roughly 4B tokens.
  • General data: RedPajama-Data is interleaved at a batch ratio of 1 (general) : 4 (papers) : 15 (books) to prevent catastrophic forgetting.
  • Preprocessing: Uniform PDF-to-text conversion, de-duplication, and removal of non-informative text artifacts.

The loss for this knowledge injection (KI) stage is formalized as:

LKI(Φ)=i=1NlogΦ(uiu<i)L_{\mathrm{KI}}(\Phi) = -\sum_{i=1}^N \log \Phi(u_i \mid u_{<i})

where U={u1,,uN}\mathcal{U} = \{u_1, \ldots, u_N\} is the token sequence.

2.2 Instruction Tuning (MedC-I)

The instruction tuning set (202M tokens) contains:

  • Medical conversations: 70M tokens from Med-Alpaca, Chat-Doctor, and paraphrased dialogue (GPT-4 paraphrase prompt).
  • Rationale-driven QA: 100M tokens, spanning USMLE, PubMedQA, and MedMCQA, with ChatGPT-generated rationales using both "general-style" and "option-wise" prompts.
  • Knowledge-graph prompts: 32M tokens derived from UMLS entity and relation querying.

Formatting follows a strict [INST] <instruction tokens> [/INST] <response tokens> convention. The instruction-tuning loss is:

LIT(Φ)=uiRlogΦ(uiu<i,I)L_{\mathrm{IT}}(\Phi) = -\sum_{u_i \in \mathcal{R}} \log \Phi(u_i \mid u_{<i}, \mathcal{I})

where R\mathcal{R} is the response token subset, and I\mathcal{I} is the instruction.

Quality assurance procedures include de-duplication, balanced sampling from each component, and randomized shuffling.

3. Training Regimens and Computational Framework

3.1 Knowledge Injection Training

  • Input length: 2,048 tokens per sequence
  • Batch size: Equivalent to 3,200 tokenized sequences per step
  • Optimizer: AdamW (matched to LLaMA)
  • Learning rate: 2×1052 \times 10^{-5}, constant
  • Floating point: bf16 mixed precision
  • Distributed training: FSDP with gradient checkpointing across 32 NVIDIA A100 GPUs
  • Epochs: 5 (based on single pass through all textbook tokens)

3.2 Instruction Tuning

  • Input length: 2,048 tokens
  • Batch size: 256
  • Same optimizer and learning rate
  • bf16 precision
  • Epochs: 3 (202M total tokens per epoch)
  • Compute: 8 NVIDIA A100 GPUs

4. Ablation Studies and Performance on Medical QA

Three public QA benchmarks were used: MedQA (USMLE-derived, 4 choices), MedMCQA (medical entrance exams), and PubMedQA (yes/no/maybe QA). The table below summarizes key ablations:

Method Size MedQA MedMCQA PubMedQA
Baseline LLaMA 7B 44.54% 48.51% 73.40%
Baseline LLaMA 13B 45.48% 51.42% 76.40%
PMC-LLaMAₖ (papers only) 7B 44.70% 50.54% 69.50%
PMC-LLaMAₖ (papers + books) 7B 45.56% 51.45% 74.60%
PMC-LLaMAₖ 13B 48.15% 54.15% 77.10%
PMC-LLaMA (+ rationale only) 13B 49.32% 54.56% 77.20%
PMC-LLaMA (+ rationale + conversation) 13B 54.43% 55.77% 77.00%
PMC-LLaMA (full: rationale+conv+KG) 13B 56.36% 56.04% 77.90%
  • Model scale (13B) provides consistent gains over 7B.
  • Integration of textbooks with papers is superior to papers alone.
  • Each component of instruction tuning (rationale, conversation, knowledge-graph) yields additional 1–5% increases in QA accuracy.

5. Benchmarking Against State-of-the-Art

The final PMC-LLaMA model was evaluated in zero-shot instruction setting against existing models and human baselines:

Method Model size MedQA MedMCQA PubMedQA Avg.
Human (pass) 50.0% 60.0%
Human (expert) 87.0% 90.0% 78.0% 85.0%
ChatGPT 175B 57.0% 44.0% 63.9% 54.97%
LLaMA-2 13B 42.7% 37.4% 68.0% 49.4%
LLaMA-2 70B 43.7% 35.0% 74.3% 51.0%
Med-Alpaca 13B 30.9% 31.1% 53.2% 38.4%
Chat-Doctor 7B 33.9% 31.1% 54.3% 39.8%
PMC-LLaMA 13B 56.4% 56.0% 77.9% 64.4%

PMC-LLaMA surpasses all open-source models by a wide margin and outperforms ChatGPT by approximately 9.5 percentage points on average with just 1/13th of the parameters.

6. Public Release and Reproducibility

All models, datasets, and codebases for PMC-LLaMA are open-sourced under Apache 2.0 at https://github.com/chaoyi-wu/PMC-LLaMA. The repository includes:

  • Pre-trained PMC-LLaMA-13B checkpoint (bf16 precision)
  • End-to-end training scripts (PyTorch + HuggingFace)
  • Dataset curation and download tools (MedC-K and MedC-I)
  • Inference examples (zero/few-shot QA)
  • Model card, license, and citation guidelines

These resources enable direct evaluation, extension, and further development by the research community.

7. Limitations and Prospective Directions

  • Resource intensity: Pretraining required 32 × A100 GPUs for 5 epochs; instruction tuning used 8 × A100 for 3 epochs.
  • Conversational coverage: Free-form, open-domain conversation is less capable than models like ChatGPT.
  • Domain scope: The focus is on US-style multiple-choice QA; real-world clinical tasks (e.g., EMR, long-form notes, multimodal reasoning) remain untested.
  • Future work: Incorporation of clinical records and imaging data; evaluation on open-ended tasks; exploration of parameter-efficient fine-tuning strategies (e.g., LoRA, adapters) to mitigate compute demands.

This suggests that while PMC-LLaMA represents a state-of-the-art open-source medical LLM within evaluation scope, broader clinical applicability and efficiency remain open research directions.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PMC-LLaMA.