Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine (2308.09442v2)

Published 18 Aug 2023 in cs.CE

Abstract: Foundation models (FMs) have exhibited remarkable performance across a wide range of downstream tasks in many domains. Nevertheless, general-purpose FMs often face challenges when confronted with domain-specific problems, due to their limited access to the proprietary training data in a particular domain. In biomedicine, there are various biological modalities, such as molecules, proteins, and cells, which are encoded by the language of life and exhibit significant modality gaps with human natural language. In this paper, we introduce BioMedGPT, an open multimodal generative pre-trained transformer (GPT) for biomedicine, to bridge the gap between the language of life and human natural language. BioMedGPT allows users to easily ``communicate'' with diverse biological modalities through free text, which is the first of its kind. BioMedGPT aligns different biological modalities with natural language via a large generative LLM, namely, BioMedGPT-LM. We publish BioMedGPT-10B, which unifies the feature spaces of molecules, proteins, and natural language via encoding and alignment. Through fine-tuning, BioMedGPT-10B outperforms or is on par with human and significantly larger general-purpose foundation models on the biomedical QA task. It also demonstrates promising performance in the molecule QA and protein QA tasks, which could greatly accelerate the discovery of new drugs and therapeutic targets. In addition, BioMedGPT-LM-7B is the first large generative LLM based on Llama2 in the biomedical domain, therefore is commercial friendly. Both BioMedGPT-10B and BioMedGPT-LM-7B are open-sourced to the research community. In addition, we publish the datasets that are meticulously curated for the alignment of multi-modalities, i.e., PubChemQA and UniProtQA. All the models, codes, and datasets are available at \url{https://github.com/PharMolix/OpenBioMed}.

BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

The paper introduces BioMedGPT, a generative pre-trained transformer designed for the biomedical domain. BioMedGPT addresses the challenge faced by general-purpose foundation models when dealing with domain-specific problems due to limited access to proprietary training data in specific areas such as biomedicine. The complex biomedical domain encompasses numerous modalities like molecules, proteins, and cells, which are not easily accessible to typical LLMs.

Key Contributions

  1. BioMedGPT Framework: The framework bridges the language of life, represented by biological modalities, and human natural language. A large-scale generative LLM, BioMedGPT-LM, underpins this alignment, encoded through a unified feature space for molecules, proteins, and text.
  2. BioMedGPT-10B: This model specializes in biomedical question-answering tasks, allowing for superior comprehension and reasoning over various biological data. Significantly, BioMedGPT-10B outperforms larger general-purpose models in specific tasks, demonstrating its strength in targeted applications.
  3. BioMedGPT-LM-7B: Built on Llama2, this is the first commercial-friendly LLM for biomedicine, enhancing the domain's accessibility.
  4. Curated Datasets: The publication of datasets like PubChemQA and UniProtQA facilitates the alignment of multiple modalities, supporting broader research efforts.

Experimental Results

The paper shows BioMedGPT's capability in three key tasks:

  • Biomedical QA: BioMedGPT-10B achieves strong results in biomedical QA tasks, matching or surpassing human-level accuracy on benchmarks like PubMedQA and MedMCQA, which underscores its domain-specific strength.
  • Molecule QA: The model effectively describes molecular structures from text input, demonstrating its ability to translate between molecular language and human language.
  • Protein QA: BioMedGPT-10B provides robust text-based answers regarding protein functions. This feat highlights the model’s ability to bridge the gap between structured biological information and natural language.

Implications and Future Directions

BioMedGPT's contributions advance the ability of AI systems to understand and process the language of life, with significant implications for drug discovery and the understanding of complex biological systems. The model's open-source nature encourages further exploration in multimodal learning within the biomedical field.

Future developments may expand BioMedGPT's capabilities to include more biological modalities and improve interpretability and safety in AI models. Innovations in evaluation metrics and enhanced model transparency are critical for achieving these advancements.

Through these efforts, BioMedGPT sets a foundation for the integration of AI in scientific research, fostering a deeper collaboration between human and artificial intelligence within the biomedical sector. It not only provides a tool for current applications but also opens pathways for future breakthroughs in biomedical research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yizhen Luo (10 papers)
  2. Jiahuan Zhang (6 papers)
  3. Siqi Fan (31 papers)
  4. Kai Yang (187 papers)
  5. Yushuai Wu (6 papers)
  6. Mu Qiao (28 papers)
  7. Zaiqing Nie (27 papers)
Citations (59)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub