AI Research Assistant for Computer Scientists
Overview
-
BioMedLM is a 2.7 billion parameter model specially trained on PubMed abstracts and articles for biomedical NLP tasks, illustrating competitive performance.
-
Designed as a GPT-style model, BioMedLM prioritizes efficiency and specialization with training solely on PubMed data, demonstrating feasibility on modest hardware.
-
It achieves impressive results on biomedical question-answering benchmarks, outperforming or closely rivaling larger and generalist models in specific tasks.
-
BioMedLM's approach addresses key issues in healthcare NLP applications, including data privacy, cost-effectiveness, and reducing environmental impact.
BioMedLM: A Specialized Language Model for Biomedical NLP Tasks
Introduction
In recent years, language models such as GPT-4° and Med-PaLM° 2 have significantly advanced the field of NLP across various domains, including biomedicine. However, their vast size, proprietary nature, and resource-intensive demands pose serious practical limitations, especially for applications requiring data privacy, cost-effectiveness, and environmental sustainability°. Addressing these challenges, the paper introduces BioMedLM, a 2.7 billion parameter model, specifically trained on PubMed° abstracts and full articles. BioMedLM demonstrates competitive performance on biomedical NLP tasks°, such as multiple-choice question-answering and patient-focused medical question generation, against its significantly larger counterparts.
Model Design and Training
BioMedLM is architected as a GPT-style autoregressive model, with a domain-specific tokenizer trained to efficiently handle biomedical terminology. Unlike large-scale general models, BioMedLM's training exclusively leverages PubMed data, aiming at improved efficiency in biomedical contexts without the computational and financial overheads associated with larger models. The training was executed on 128 40GB Nvidia A100° GPUs, demonstrating the feasibility of training and running medium-sized models on modest hardware configurations.
Evaluation on Biomedical Tasks
BioMedLM's performance was rigorously evaluated across a suite of biomedical question-answering tasks including MedMCQA, MedQA, MMLU°, PubMedQA, and BioASQ. Notably, BioMedLM achieved a score of 57.3% on MedMCQA and 69.0% on the MMLU Medical Genetics exam, outperforming or closely rivaling models like GPT-Neo° 2.7B and even some larger models on specific tasks. This reveals that a domain-specific focus during training can yield models with competitive task performance, while also being more accessible and practical for specialized applications.
Practical Implications and Future Directions
The paper underscores the capabilities of smaller, domain-focused models to meet or exceed the performance of larger, generalist models° on specific tasks. BioMedLM's approach addresses several critical concerns in deploying NLP technologies in sensitive areas like healthcare:
- Privacy and Security: With full training on publicly available PubMed data and the ability to run on local hardware, BioMedLM offers a transparent and secure alternative to proprietary models° that require data transmission over the internet.
- Cost and Accessibility: The training and inference efficiency of BioMedLM make it a feasible option for organizations with limited budgets, democratizing access to advanced NLP capabilities.
- Environmental Impact: By demonstrating strong performance with significantly fewer parameters, BioMedLM presents an environmentally friendlier option compared to training and operating larger models.
Looking ahead, this work opens several avenues for future research, including the exploration of training techniques that further optimize performance and efficiency for domain-specific models°. Additionally, extending the methodology to other specialized fields could yield similarly effective models across a broader range of disciplines.
Conclusion
BioMedLM exemplifies the potential of medium-sized, domain-focused models to achieve high performance on specialized tasks, challenging the prevailing assumption that larger models always perform better. By balancing efficiency with capability, BioMedLM represents a significant step forward in making advanced NLP technology more accessible, transparent, and sustainable, particularly in critical fields such as biomedicine.
- The promise of large language models in health care. The Lancet, 401(10377):641, 2023. doi: 10.1016/s0140-6736(23)00216-7.
- SciBERT: A pretrained language model for scientific text, 2019. URL https://arxiv.org/abs/1903.10676.
- On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1215. URL https://aclanthology.org/P19-1215.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- MedBLIP: Bootstrapping language-image pre-training from 3D medical images and texts, 2023. URL https://arxiv.org/abs/2305.10799.
- PaLM: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
- Understanding accountability in algorithmic supply chains. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1186–1197, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594073. URL https://doi.org/10.1145/3593013.3594073.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness, 2022. URL https://arxiv.org/abs/2205.14135.
- Harm De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
- Informed named entity recognition decoding for generative language models, 2023. URL https://arxiv.org/abs/2308.07791.
- Summarization of clinical information: A conceptual model. Journal of Biomedical Informatics, 44(4):688–699, 2011. doi: 10.1016/j.jbi.2011.03.008.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URL https://arxiv.org/abs/2101.00027.
- News summarization and evaluation in the era of GPT-3, 2023. URL https://arxiv.org/abs/2209.12356.
- Olmo: Accelerating the science of language models, 2024. arXiv preprint arXiv:2402.00838.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23, 2021. doi: 10.1145/3458754.
- Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300.
- Huggingface. Huggingface/tokenizers: fast state-of-the-art tokenizers optimized for research and production, 2019. URL https://github.com/huggingface/tokenizers.
- What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, Jul 2021. ISSN 2076-3417. doi: 10.3390/app11146421. URL http://dx.doi.org/10.3390/app11146421.
- PubMedQA: A dataset for biomedical research question answering, 2019. URL https://arxiv.org/abs/1909.06146.
- GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information, 2023. URL https://arxiv.org/abs/2304.09667.
- On the societal impact of open foundation models, 2024. URL https://crfm.stanford.edu/open-fms/paper.pdf.
- Mistral — a journey towards reproducible language model training, 2021. URL https://crfm.stanford.edu/2021/08/26/mistral.html.
- Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinformatics, 24(290), 2023. doi: https://doi.org/10.1186/s12859-023-05411-z.
- Dense passage retrieval for open-domain question answering, 2020. URL https://arxiv.org/abs/2004.04906.
- Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2022. doi: 10.1101/2022.12.19.22283643.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, Sep 2019. doi: 10.1093/bioinformatics/btz682. URL https://doi.org/10.1093%2Fbioinformatics%2Fbtz682.
- Summary of ChatGPT/GPT-4 research and perspective towards the future of large language models, 2023. URL https://arxiv.org/abs/2304.01852.
- Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101.
- Analyzing leakage of personally identifiable information in language models, 2023. URL https://arxiv.org/abs/2302.00539.
- BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022. doi: 10.1093/bib/bbac409.
- AI chatbots, health privacy, and challenges to HIPAA compliance. JAMA, 330(4):309, 2023. doi: 10.1001/jama.2023.9458.
- The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine, 6(1), 2023. doi: 10.1038/s41746-023-00873-0.
- MosaicML. Composer. https://github.com/mosaicml/composer/, 2021.
- MedKnowts: Unified documentation and information retrieval for electronic health records. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, October 2021. doi: 10.1145/3472749.3474814. URL https://doi.org/10.1145%2F3472749.3474814.
- Capabilities of GPT-4 on medical challenge problems, 2023a. URL https://arxiv.org/abs/2303.13375.
- Can generalist foundation models outcompete special-purpose tuning? Case study in medicine, 2023b. URL https://arxiv.org/abs/2311.16452.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
- MedMCQA : A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022. URL https://arxiv.org/abs/2203.14371.
- Comparative performance evaluation of large language models for extracting molecular interactions and pathway knowledge, 2023. URL https://arxiv.org/abs/2307.08813.
- PyTorch: An imperative style, high-performance deep learning library, 2019. URL https://arxiv.org/abs/1912.01703.
- Carbon emissions and large neural network training, 2021. URL https://arxiv.org/abs/2104.10350.
- Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
- Efficient domain adaptation of language models via adaptive tokenization, 2021. URL https://arxiv.org/abs/2109.07460.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
- Compute trends across three eras of machine learning, 2022. URL https://arxiv.org/abs/2202.05924.
- Creation and adoption of large language models in medicine. JAMA, 330(9):866, 2023. doi: 10.1001/jama.2023.14217.
- The cost of training NLP models: A concise overview, 2020. URL https://arxiv.org/abs/2004.08900.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a. doi: 10.1038/s41586-023-06291-2.
- Towards expert-level medical question answering with large language models, 2023b. URL https://arxiv.org/abs/2305.09617.
- Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024. arXiv preprint arXiv:2402.00159.
- Galactica: A large language model for science, 2022. URL https://arxiv.org/abs/2211.09085.
- Large language models in medicine. Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/s41591-023-02448-8.
- Opportunities and challenges for ChatGPT and large language models in biomedicine and health, 2023. URL https://arxiv.org/abs/2306.10070.
- Together. Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models, May 2023a. URL https://www.together.ai/blog/redpajama-models-v1.
- Together. RedPajama: An open dataset for training large language models, October 2023b. URL https://github.com/togethercomputer/RedPajama-Data.
- LLaMA: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971.
- An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(1), 2015. doi: 10.1186/s12859-015-0564-6.
- Open-ended medical visual question answering through prefix tuning of language models, 2023. URL https://arxiv.org/abs/2303.05977.
- Attention is all you need, 2017. URL https://arxiv.org/abs/1706.03762.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- A systematic review of automatic text summarization for biomedical literature and EHRs. Journal of the American Medical Informatics Association, 28(10):2287–2297, 2021. doi: 10.1093/jamia/ocab143.
- Bfloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019.
- Zuoxi Yang. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 2486, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. doi: 10.1145/3397271.3401458. URL https://doi.org/10.1145/3397271.3401458.
- Deep bidirectional language-knowledge graph pretraining, 2022a. URL https://arxiv.org/abs/2210.09338.
- LinkBERT: Pretraining language models with document links, 2022b. URL https://arxiv.org/abs/2203.15827.
- Appraising the potential uses and harms of LLMs for medical systematic reviews, 2023. URL https://arxiv.org/abs/2305.11828.
- Benchmarking large language models for news summarization, 2023. URL https://arxiv.org/abs/2301.13848.
- Learning to summarize radiology findings, 2018. URL https://arxiv.org/abs/1809.04698.
- A survey of large language models, 2023. URL https://arxiv.org/abs/2303.18223.
- When does pretraining help? Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 2021. doi: 10.1145/3462757.3466088.
- Improving the transferability of clinical note section classification models with BERT and large language model ensembles. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 125–130, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.clinicalnlp-1.16. URL https://aclanthology.org/2023.clinicalnlp-1.16.
- Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593.
- Elliot Bolton (2 papers)
- Abhinav Venigalla (5 papers)
- Michihiro Yasunaga (43 papers)
- David Hall (30 papers)
- Betty Xiong (5 papers)
- Tony Lee (19 papers)
- Roxana Daneshjou (15 papers)
- Jonathan Frankle (36 papers)
- Percy Liang (218 papers)
- Michael Carbin (43 papers)
- Christopher D. Manning (158 papers)