BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining (2210.10341v3)

Published 19 Oct 2022 in cs.CL and cs.AI

Abstract: Pre-trained LLMs have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained LLMs in the general language domain, i.e., BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer LLM pre-trained on large scale biomedical literature. We evaluate BioGPT on six biomedical NLP tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms. Code is available at https://github.com/microsoft/BioGPT.

PDF Abstract

BioGPT: Domain-Specific LLM for Biomedical Text Analytics

The paper presents BioGPT, a generative pre-trained Transformer LLM specifically tailored for biomedical text generation and mining. This research addresses the limitations observed in applying general LLMs like GPT directly to biomedical tasks due to domain-specific nuances. Unlike previous works which predominantly focused on BERT-like models for understanding tasks, BioGPT ventures into generation, filling a critical gap in biomedical NLP.

Model Architecture and Pre-training

BioGPT is built on the Transformer architecture, adopting the GPT-2 medium model, and is trained from scratch on a substantial biomedical corpus comprising 15 million PubMed abstracts. This approach ensures that BioGPT captures domain-specific terminology and contexts, essential for accurate text generation in the biomedical field. The authors emphasize the significance of using in-domain data and a learned vocabulary from this specific corpus for effective pre-training.

Evaluation and Results

BioGPT was evaluated across six biomedical NLP tasks, showcasing its superior performance against existing models. Noteworthy results include:

Relation Extraction: Achieved F1 scores of 44.98%, 38.42%, and 40.76% on the BC5CDR, KD-DTI, and DDI datasets, respectively. These results surpass both pipeline-based methods and prior sequence-to-sequence approaches such as REBEL.
Question Answering: Set a new benchmark with 78.2% accuracy on the PubMedQA dataset, highlighting its potential in biomedical QA systems.
Document Classification: Scored 85.12% F1 on the HoC dataset, improving over previous models like BioBERT and PubMedBERT.
Text Generation: Demonstrated significant improvements in generating coherent and contextually relevant biomedical text compared to GPT-2.

Methodological Insights

The paper explores the impact of prompt design and target sequence formatting on model performance. Experiments indicate that representing output in natural language formats provides better results than structured sequences with special tokens. This finding underscores the importance of aligning output formats with pre-trained model characteristics. Additionally, using continuous embeddings as soft prompts proved advantageous, varying slightly in optimal length based on the task.

Implications and Future Directions

BioGPT's success on multiple fronts suggests that domain-specific LLMs can effectively bridge the gap where general models falter due to domain shift. This advancement opens new avenues for integrating generative models in biomedical contexts, enhancing tasks like automated literature review, clinical decision support, and drug interaction prediction.

Looking forward, expanding BioGPT to larger models and wider biomedical corpora could further enhance its capabilities. There is also potential for exploring its application in more diverse biomedical NLP tasks such as clinical narrative mining, fostering comprehensive developments in AI-driven healthcare solutions.

This paper crucially illustrates the merit of pre-training LLMs on domain-specific corpora and thoughtfully designing task adaptations for optimized downstream performance. BioGPT sets a precedent for future research in specialized generative models within complex and nuanced domains like biomedicine.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Renqian Luo (19 papers)
Liai Sun (1 paper)
Yingce Xia (53 papers)
Tao Qin (201 papers)
Sheng Zhang (212 papers)
Hoifung Poon (61 papers)
Tie-Yan Liu (242 papers)

Citations (647)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/BioGPT (4,302 stars)

Tweets

https://twitter.com/Satyen_Baindur/status/1752199391883276336

YouTube

Show All Videos