Emergent Mind

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

Published Mar 27, 2024 in cs.CL and cs.AI


Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine. The model is available on the Hugging Face Hub: https://huggingface.co/stanford-crfm/BioMedLM.
Comparison of BiomedLM's and GPT-Neo's performance on biomedical language processing tasks.


  • BioMedLM is a 2.7 billion parameter model specially trained on PubMed abstracts and articles for biomedical NLP tasks, illustrating competitive performance.

  • Designed as a GPT-style model, BioMedLM prioritizes efficiency and specialization with training solely on PubMed data, demonstrating feasibility on modest hardware.

  • It achieves impressive results on biomedical question-answering benchmarks, outperforming or closely rivaling larger and generalist models in specific tasks.

  • BioMedLM's approach addresses key issues in healthcare NLP applications, including data privacy, cost-effectiveness, and reducing environmental impact.


In recent years, language models such as GPT-4 and Med-PaLM 2 have significantly advanced the field of NLP across various domains, including biomedicine. However, their vast size, proprietary nature, and resource-intensive demands pose serious practical limitations, especially for applications requiring data privacy, cost-effectiveness, and environmental sustainability. Addressing these challenges, the study introduces BioMedLM, a 2.7 billion parameter model, specifically trained on PubMed abstracts and full articles. BioMedLM demonstrates competitive performance on biomedical NLP tasks, such as multiple-choice question-answering and patient-focused medical question generation, against its significantly larger counterparts.

Model Design and Training

BioMedLM is architected as a GPT-style autoregressive model, with a domain-specific tokenizer trained to efficiently handle biomedical terminology. Unlike large-scale general models, BioMedLM's training exclusively leverages PubMed data, aiming at improved efficiency in biomedical contexts without the computational and financial overheads associated with larger models. The training was executed on 128 40GB Nvidia A100 GPUs, demonstrating the feasibility of training and running medium-sized models on modest hardware configurations.

Evaluation on Biomedical Tasks

BioMedLM's performance was rigorously evaluated across a suite of biomedical question-answering tasks including MedMCQA, MedQA, MMLU, PubMedQA, and BioASQ. Notably, BioMedLM achieved a score of 57.3% on MedMCQA and 69.0% on the MMLU Medical Genetics exam, outperforming or closely rivaling models like GPT-Neo 2.7B and even some larger models on specific tasks. This reveals that a domain-specific focus during training can yield models with competitive task performance, while also being more accessible and practical for specialized applications.

Practical Implications and Future Directions

The study underscores the capabilities of smaller, domain-focused models to meet or exceed the performance of larger, generalist models on specific tasks. BioMedLM's approach addresses several critical concerns in deploying NLP technologies in sensitive areas like healthcare:

  • Privacy and Security: With full training on publicly available PubMed data and the ability to run on local hardware, BioMedLM offers a transparent and secure alternative to proprietary models that require data transmission over the internet.

  • Cost and Accessibility: The training and inference efficiency of BioMedLM make it a feasible option for organizations with limited budgets, democratizing access to advanced NLP capabilities.

  • Environmental Impact: By demonstrating strong performance with significantly fewer parameters, BioMedLM presents an environmentally friendlier option compared to training and operating larger models.

Looking ahead, this work opens several avenues for future research, including the exploration of training techniques that further optimize performance and efficiency for domain-specific models. Additionally, extending the methodology to other specialized fields could yield similarly effective models across a broader range of disciplines.


BioMedLM exemplifies the potential of medium-sized, domain-focused models to achieve high performance on specialized tasks, challenging the prevailing assumption that larger models always perform better. By balancing efficiency with capability, BioMedLM represents a significant step forward in making advanced NLP technology more accessible, transparent, and sustainable, particularly in critical fields such as biomedicine.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. The promise of large language models in health care. The Lancet, 401(10377):641, 2023. doi: 10.1016/s0140-6736(23)00216-7.
  2. SciBERT: A Pretrained Language Model for Scientific Text
  3. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1215. https://aclanthology.org/P19-1215.

  4. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.

  5. Gpt-neox-20b: An open-source autoregressive language model
  6. Language Models are Few-Shot Learners
  7. MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts
  8. PaLM: Scaling Language Modeling with Pathways
  9. Understanding accountability in algorithmic supply chains. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1186–1197, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594073. https://doi.org/10.1145/3593013.3594073.
  10. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
  11. Harm De Vries. Go smol or go home, 2023. https://www.harmdevries.com/post/model-size-vs-compute-overhead/.

  12. Informed Named Entity Recognition Decoding for Generative Language Models
  13. Summarization of clinical information: A conceptual model. Journal of Biomedical Informatics, 44(4):688–699, 2011. doi: 10.1016/j.jbi.2011.03.008.
  14. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  15. News Summarization and Evaluation in the Era of GPT-3
  16. OLMo: Accelerating the Science of Language Models
  17. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23, 2021. doi: 10.1145/3458754.
  18. Measuring Massive Multitask Language Understanding
  19. Huggingface. Huggingface/tokenizers: fast state-of-the-art tokenizers optimized for research and production, 2019. https://github.com/huggingface/tokenizers.

  20. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, Jul 2021. ISSN 2076-3417. doi: 10.3390/app11146421. http://dx.doi.org/10.3390/app11146421.

  21. PubMedQA: A Dataset for Biomedical Research Question Answering
  22. GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information
  23. On the societal impact of open foundation models, 2024. https://crfm.stanford.edu/open-fms/paper.pdf.

  24. Mistral — a journey towards reproducible language model training, 2021. https://crfm.stanford.edu/2021/08/26/mistral.html.

  25. Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinformatics, 24(290), 2023. doi: https://doi.org/10.1186/s12859-023-05411-z.

  26. Dense Passage Retrieval for Open-Domain Question Answering
  27. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2022. doi: 10.1101/2022.12.19.22283643.
  28. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, Sep 2019. doi: 10.1093/bioinformatics/btz682. https://doi.org/10.1093%2Fbioinformatics%2Fbtz682.

  29. Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models
  30. Decoupled Weight Decay Regularization
  31. Analyzing Leakage of Personally Identifiable Information in Language Models
  32. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022. doi: 10.1093/bib/bbac409.
  33. AI chatbots, health privacy, and challenges to HIPAA compliance. JAMA, 330(4):309, 2023. doi: 10.1001/jama.2023.9458.
  34. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine, 6(1), 2023. doi: 10.1038/s41746-023-00873-0.
  35. MosaicML. Composer. https://github.com/mosaicml/composer/

  36. MedKnowts: Unified documentation and information retrieval for electronic health records. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, October 2021. doi: 10.1145/3472749.3474814. https://doi.org/10.1145%2F3472749.3474814.
  37. Capabilities of GPT-4 on Medical Challenge Problems
  38. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
  39. Training language models to follow instructions with human feedback
  40. MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
  41. Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge
  42. PyTorch: An Imperative Style, High-Performance Deep Learning Library
  43. Carbon Emissions and Large Neural Network Training
  44. Language models are unsupervised multitask learners, 2019. https://api.semanticscholar.org/CorpusID:160025533.

  45. Efficient Domain Adaptation of Language Models via Adaptive Tokenization
  46. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. https://aclanthology.org/P16-1162.

  47. Compute Trends Across Three Eras of Machine Learning
  48. Creation and adoption of large language models in medicine. JAMA, 330(9):866, 2023. doi: 10.1001/jama.2023.14217.
  49. The Cost of Training NLP Models: A Concise Overview
  50. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a. doi: 10.1038/s41586-023-06291-2.
  51. Towards Expert-Level Medical Question Answering with Large Language Models
  52. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
  53. Galactica: A Large Language Model for Science
  54. Large language models in medicine. Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/s41591-023-02448-8.
  55. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health
  56. Together. Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models, May 2023a. https://www.together.ai/blog/redpajama-models-v1.

  57. Together. RedPajama: An open dataset for training LLMs, October 2023b. https://github.com/togethercomputer/RedPajama-Data.

  58. LLaMA: Open and Efficient Foundation Language Models
  59. An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(1), 2015. doi: 10.1186/s12859-015-0564-6.
  60. Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
  61. Attention Is All You Need
  62. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

  63. A systematic review of automatic text summarization for biomedical literature and EHRs. Journal of the American Medical Informatics Association, 28(10):2287–2297, 2021. doi: 10.1093/jamia/ocab143.
  64. Bfloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

  65. Zuoxi Yang. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 2486, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. doi: 10.1145/3397271.3401458. https://doi.org/10.1145/3397271.3401458.
  66. Deep Bidirectional Language-Knowledge Graph Pretraining
  67. LinkBERT: Pretraining Language Models with Document Links
  68. Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews
  69. Benchmarking Large Language Models for News Summarization
  70. Learning to Summarize Radiology Findings
  71. A Survey of Large Language Models
  72. When does pretraining help? Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 2021. doi: 10.1145/3462757.3466088.
  73. Improving the transferability of clinical note section classification models with BERT and large language model ensembles. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 125–130, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.clinicalnlp-1.16. https://aclanthology.org/2023.clinicalnlp-1.16.

  74. Fine-Tuning Language Models from Human Preferences

Show All 74

Test Your Knowledge

You answered out of questions correctly.

Well done!