Emergent Mind

Abstract

LLMs have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.

Survey focuses on Sci-LLMs in scientific and biochemical languages, including textual, molecular, and genomic.

Overview

  • Sci-LLMs are a niche advancement in AI designed to aid scientific discovery by interpreting and generating complex scientific languages.

  • These models necessitate vast, multifaceted datasets and advanced architectures like modified Transformers to handle unique scientific data structures.

  • The paper highlights ongoing challenges such as the scarcity of quality training datasets, especially cross-modal ones, and the difficulty of evaluating Sci-LLMs.

  • Ethical concerns, including data privacy and preventing misuse, are critical in the development and deployment of Sci-LLMs.

  • Future research will focus on enlarging training datasets, improving structural data integration, and developing better evaluation metrics.

Introduction

Scientific LLMs (Sci-LLMs) encompass an advanced subclass specifically crafted for facilitating scientific discovery within the AI-for-Science community. These models delve deep into the realm of "scientific language", a term that refers to specialized vocabularies and grammatical constructs developed within scientific disciplines, distinct from conventional natural language. This survey presents an intricate examination of Sci-LLMs, focusing on their roles in the biological and chemical domains.

Data and Model Architecture

A core aspect of Sci-LLM development involves constructing comprehensive datasets for training and fine-tuning these models. Such datasets span textual, molecular, protein, and genomic languages, often surpassing the scope and complexity of standard linguistic systems. Sci-LLMs require robust architectures that can accommodate the idiosyncrasies of scientific data—lengthy sequences in molecular languages, intricate 3D structures in proteins, or the multi-modal nature encompassing text and other scientific entities. To address these challenges, researchers have devised variations on the Transformer architecture, integrating novel attention mechanisms and pre-training strategies.

Training and Evaluation Challenges

The survey notes that despite recent advancements, there are persistent challenges concerning the scale and quality of training datasets. Cross-modal datasets, essential for enabling multi-faceted interactions among different types of scientific data, are particularly scarce and require rigorous semantic alignment. Moreover, evaluating Sci-LLMs poses its own set of complexities, especially for generative tasks where the gold standard remains wet-lab experiments. To circumvent the need for exhaustive experimental validation, developing computational benchmarks and metrics that can reliably predict real-world outcomes is indispensable.

Ethical Considerations

Ethical considerations stand at the forefront, given Sci-LLMs' potential impact on sensitive areas like genomics. Data privacy, consent, bias mitigation, misuse prevention, and equitable access to technological benefits are paramount. Integrating ethical principles within Sci-LLMs is as much a technical challenge as it is a moral imperative.

Future Directions

Looking ahead, the survey suggests seven key research directions to hone the capabilities of Sci-LLMs. Among these, expanding the scale of pre-training datasets and incorporating 3D structural data are top priorities. Equally important is refining the evaluation metrics for models, which will be central to validating the generated scientific entities.

Conclusion

Concluding, the survey lays out both the triumphs and tribulations of Sci-LLMs in navigating the complex landscape of scientific languages. By capturing the essence of biological and chemical domains within a computational framework, Sci-LLMs not only accelerate scientific discovery but also pave the way toward more generalized artificial intelligence.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Hisham Abdel-Aty and Ian R Gould. 2022. Large-scale distributed training of transformers for chemical fingerprinting. Journal of Chemical Information and Modeling 62, 20 (2022), 4852–4862.
  2. Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers
  3. Sanjar Adilov. 2021. Generative pre-training from molecules. (2021).
  4. ChemBERTa-2: Towards Chemical Foundation Models
  5. The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
  6. Sultan Alrowili and Vijay Shanker. 2021. BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA. In Proceedings of the 20th Workshop on Biomedical Language Processing. 221–227.
  7. MoDNA: motif-oriented pre-training for DNA language model. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 1–5.
  8. Gene ontology: tool for the unification of biology. Nature genetics 25, 1 (2000), 25–29.
  9. PoET: A generative model of protein families as sequences-of-sequences
  10. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods 18, 10 (2021), 1196–1203.
  11. Simon Axelrod and Rafael Gomez-Bombarelli. 2022. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data 9, 1 (2022), 185.
  12. Sarp Aykent and Tian Xia. 2022. Gbpnet: Universal geometric representation learning on protein structures. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4–14.
  13. MolGPT: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling 62, 9 (2021), 2064–2076.
  14. Baichuan 2: Open Large-scale Language Models
  15. Amos Bairoch and Rolf Apweiler. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research 28, 1 (2000), 45–48.
  16. GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction
  17. Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
  18. SciBERT: A Pretrained Language Model for Scientific Text
  19. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv (2023), 2023–10.
  20. DNA language models are powerful zero-shot predictors of genome-wide variant effects. bioRxiv (2022), 2022–08.
  21. DNA language models are powerful predictors of genome-wide variant effects. Proceedings of the National Academy of Sciences 120, 44 (2023), e2311219120.
  22. Mostapha Benhenda. 2017. ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? (2017).
  23. Tristan Bepler and Bonnie Berger. 2021. Learning the protein language: Evolution, structure, and function. Cell systems 12, 6 (2021), 654–669.
  24. Lorenz C Blum and Jean-Louis Reymond. 2009. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. Journal of the American Chemical Society 131, 25 (2009), 8732–8733.
  25. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
  26. UniProtKB/Swiss-Prot: the manually annotated section of the UniProt KnowledgeBase. In Plant bioinformatics: methods and protocols. Springer, 89–112.
  27. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Plant bioinformatics: methods and protocols (2016), 23–54.
  28. ChemCrow: Augmenting large-language models with chemistry tools
  29. Transformers and Large Language Models for Chemistry and Drug Discovery
  30. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 8 (2022), 2102–2110.
  31. GuacaMol: benchmarking models for de novo molecular design. Journal of chemical information and modeling 59, 3 (2019), 1096–1108.
  32. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  33. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  34. Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design. In International Conference on Machine Learning. PMLR, 1261–1271.
  35. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic acids research 49, D1 (2021), D498–D508.
  36. xTrimoPGLM: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv (2023), 2023–07.
  37. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nature communications 12, 1 (2021)
  38. Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction. bioRxiv (2023), 2023–01.
  39. Qiyuan Chen and Cheng Deng. 2023. Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation. bioRxiv (2023), 2023–10.
  40. Deep generative model for drug design from protein target sequence. Journal of Cheminformatics 15, 1 (2023), 38.
  41. CAGI 5 splicing challenge: improved exon skipping and intron retention predictions with MMSplice. Human mutation 40, 9 (2019), 1243–1251.
  42. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, 6664 (2023), eadg7492.
  43. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org

  44. BARTSmiles: Generative Masked Language Models for Molecular Representations
  45. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
  46. iupacGPT: IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation. (2023).
  47. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  48. Unifying Molecular and Textual Representations via Multi-task Language Modelling
  49. Simon Chu and Kathy Wei. 2023. Generative Antibody Design for Complementary Chain Pairing Sequences through Encoder-Decoder Language Model. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.
  50. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  51. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  52. To Transformers and Beyond: Large Language Models for the Genome
  53. 1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68.
  54. The UniProt Consortium. 2021. UniProt: the universal protein knowledgebase in 2021. Nucleic acids research 49, D1 (2021), D480–D489.
  55. The UniProt Consortium. 2023. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research 51, D1 (2023), D523–D531.
  56. UniProt Consortium. 2019. UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47, D1 (2019), D506–D515.
  57. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 7414 (2012), 57–74.
  58. Direct prediction of gas adsorption via spatial atom interaction learning. Nature Communications 14, 1 (2023)
  59. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv (2023), 2023–01.
  60. FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv (2021), 2021–11.
  61. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.
  62. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  63. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic acids research 41, D1 (2013), D157–D164.
  64. MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design
  65. GLM: General Language Model Pretraining with Autoregressive Blank Infilling
  66. Translation between Molecules and Natural Language
  67. Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 595–607.
  68. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 7112–7127.
  69. Mouse Genome Informatics (MGI): resources for mining mouse genetic, genomic, and biological data in support of primary and translational research. Systems Genetics: Methods and Protocols (2017), 47–73.
  70. Nicholas Evans and Stephen C Levinson. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and brain sciences 32, 5 (2009), 429–448.
  71. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence (2023), 1–10.
  72. Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
  73. Domain-Agnostic Molecular Generation with Chemical Feedback
  74. Molecular contrastive learning with chemical element knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3968–3976.
  75. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence (2023), 1–12.
  76. Henri A Favre and Warren H Powell. 2013. Nomenclature of organic chemistry: IUPAC recommendations and preferred names 2013. Royal Society of Chemistry.
  77. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 5696 (2004), 636–640.
  78. A deep unsupervised language model for protein design. bioRxiv (2022).
  79. Pfam: clans, web tools and services. Nucleic acids research 34, suppl_1 (2006), D247–D251.
  80. GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences. bioRxiv (2023), 2023–06.
  81. Mary Forehand. 2010. Bloom’s taxonomy. Emerging perspectives on learning, teaching, and technology 41, 4 (2010), 47–56.
  82. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019 (2019), baz046.
  83. Species-aware DNA language modeling. bioRxiv (2023), 2023–01.
  84. DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
  85. EpiGePT: a Pretrained Transformer model for epigenomics. bioRxiv (2023), 2023–07.
  86. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research 40, D1 (2012), D1100–D1107.
  87. Nomenclature and symbolism for amino acids and peptides. Pure and Applied Chemistry 56 (1984), 595–624.
  88. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic acids research 44, D1 (2016), D1045–D1053.
  89. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
  90. Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation
  91. Automated chemical reaction extraction from scientific literature. Journal of chemical information and modeling 62, 9 (2021), 2035–2045.
  92. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics 109, 2 (2017), 83–90.
  93. Graph-based molecular representation learning. IJCAI (2023).
  94. MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction. NPJ Computational Materials 8, 1 (May 2022), 102.
  95. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research 44, D1 (2016), D1214–D1219.
  96. Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model
  97. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv (2023).
  98. InChI- the worldwide chemical structure identifier standard. Journal of Cheminformatics 5, 1 (2013), 1–9.
  99. Measuring Massive Multitask Language Understanding
  100. RITA: a Study on Scaling Up Generative Protein Sequence Models
  101. David Hiscock and Chris Upton. 2000. Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes. Bioinformatics 16, 5 (2000), 484–485.
  102. SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery
  103. The Diminishing Returns of Masked Language Models to Science
  104. A systematic benchmark of machine learning methods for protein–RNA interaction prediction. Briefings in Bioinformatics 24, 5 (2023), bbad307.
  105. Protein Language Models and Structure Prediction: Connection and Progression
  106. OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs
  107. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133.
  108. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. In Advances in Neural Information Processing Systems.
  109. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling 52, 7 (2012), 1757–1768.
  110. ZINC20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling 60, 12 (2020), 6065–6073.
  111. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology 3, 1 (2022), 015022.
  112. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112–2120.
  113. Protein structure–structure alignment with discrete Fréchet distance. Journal of bioinformatics and computational biology 6, 01 (2008), 51–64.
  114. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome biology 23, 1 (2022), 1–23.
  115. PubMedQA: A Dataset for Biomedical Research Question Answering
  116. Crowdsourcing Multiple Choice Science Questions
  117. Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021), 583–589.
  118. The UCSC genome browser database. Nucleic acids research 31, 1 (2003), 51–54.
  119. A transformer model for retrosynthesis. In International Conference on Artificial Neural Networks. Springer, 817–830.
  120. Panagiotis Katsonis and Olivier Lichtarge. 2019. CAGI5: Objective performance assessments of predictions based on the Evolutionary Action equation. Human mutation 40, 9 (2019), 1436–1454.
  121. Valid, plausible, and diverse retrosynthesis using tied two-way transformers with latent variables. Journal of Chemical Information and Modeling 61, 1 (2021), 123–133.
  122. Generative chemical transformer: neural machine learning of molecular geometric structures from chemical language via attention. Journal of chemical information and modeling 61, 12 (2021), 5804–5814.
  123. PubChem 2019 update: improved access to chemical data. Nucleic acids research 47, D1 (2019), D1102–D1109.
  124. PubChem in 2021: new data content and improved web interfaces. Nucleic acids research 49, D1 (2021), D1388–D1395.
  125. PubChem 2023 update. Nucleic acids research 51, D1 (2023), D1373–D1380.
  126. PubChem substance and compound databases. Nucleic acids research 44, D1 (2016), D1202–D1213.
  127. David R Krathwohl. 2002. A revision of Bloom’s taxonomy: An overview. Theory into practice 41, 4 (2002), 212–218.
  128. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology 1, 4 (2020), 045024.
  129. ChemProt-3.0: a global chemical biology diseases mapping. Database 2016 (2016), bav123.
  130. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins: Structure, Function, and Bioinformatics 89, 12 (2021), 1607–1617.
  131. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic acids research 40, D1 (2012), D1202–D1210.
  132. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  133. Ursula K Le Guin. 2004. The Wave in the Mind: Talks and Essays on the Writer, the Reader, and the Imagination. Shambhala Publications.
  134. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (sep 2019), 1234–1240.
  135. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  136. Protein sequence design in a latent space via model-based reinforcement learning. (2022).
  137. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. 7871–7880.
  138. KPGT: knowledge-guided pre-training of graph transformer for molecular property prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 857–867.
  139. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic acids research 49, 22 (2021), e129–e129.
  140. Juncai Li and Xiaofei Jiang. 2021. Mol-BERT: an effective molecular representation with BERT for molecular property prediction. Wireless Communications and Mobile Computing 2021 (2021), 1–7.
  141. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  142. Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective
  143. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (2016).
  144. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. Bioinformatics Advances 3, 1 (2023), vbad043.
  145. Druggpt: A gpt-based strategy for designing potential ligands targeting specific proteins. bioRxiv (2023), 2023–06.
  146. PLPMpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model. Computers in Biology and Medicine 164 (2023), 107260.
  147. DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
  148. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  149. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv 2022 (2022), 500902.
  150. GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
  151. Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing
  152. Pre-training Molecular Graph Representation with 3D Geometry
  153. ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback
  154. A Text-guided Protein Design Framework
  155. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  156. MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. Journal of Molecular Graphics and Modelling 118 (2023), 108344.
  157. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
  158. MolXPT: Wrapping Molecules with Text for Generative Pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 1606–1616.
  159. S2ORC: The Semantic Scholar Open Research Corpus
  160. SCOP: a structural classification of proteins database. Nucleic acids research 28, 1 (2000), 257–259.
  161. Daniel Mark Lowe. 2012. Extraction of chemical structures and reactions from the literature. Ph. D. Dissertation. University of Cambridge.
  162. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
  163. iEnhancer-BERT: A novel transfer learning architecture based on DNA-Language model for identifying enhancers and their strength. In International Conference on Intelligent Computing. Springer, 153–165.
  164. Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training. Interdisciplinary Sciences: Computational Life Sciences 15, 1 (2023), 32–43.
  165. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23, 6 (Sept. 2022).
  166. MolFM: A Multimodal Molecular Foundation Model
  167. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine
  168. Retrieved Sequence Augmentation for Protein Representation Learning. bioRxiv (2023), 2023–02.
  169. ProGen: Language Modeling for Protein Generation
  170. Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision
  171. Vipul Mann and Venkat Venkatasubramanian. 2021. Predicting chemical reaction outcomes: A grammar ontology-based transformer framework. AIChE Journal 67, 3 (2021), e17190.
  172. Toward more general embeddings for protein design: Harnessing joint representations of sequence and structure. bioRxiv (2021), 2021–09.
  173. Molecular graph enhanced transformer for retrosynthesis prediction. Neurocomputing 457 (2021), 193–202.
  174. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 21 (2013), 2722–2728.
  175. Ensembl 2023. Nucleic acids research 51, D1 (2023), D933–D941.
  176. Molecule Attention Transformer
  177. Molecule generation using transformers and policy gradient reinforcement learning. Scientific Reports 13, 1 (2023)
  178. ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Research 37, suppl_1 (09 2008), 593–597.
  179. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv (2021).
  180. EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics (Oxford, England) 15, 3 (1999), 219–227.
  181. ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. In Machine Learning for Structural Biology Workshop. NeurIPS 2022.
  182. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. Journal of medicinal chemistry 55, 14 (2012), 6582–6594.
  183. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
  184. ProGen2: exploring the boundaries of protein language models. Cell Systems 14, 11 (2023), 968–978.
  185. Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval. In Proceedings of the 39th International Conference on Machine Learning, Vol. 162. 16990–17017.
  186. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv (2023).
  187. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. bioRxiv (2023).
  188. Introducing the bacterial and viral bioinformatics resource center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic acids research 51, D1 (2023), D678–D689.
  189. OpenAI. 2022. Introducing ChatGPT. OpenAI Blog (November 2022).
  190. OpenAI. 2023. GPT-4 Technical Report. OpenAI (2023).
  191. CATH–a hierarchic classification of protein domain structures. Structure 5, 8 (1997), 1093–1109.
  192. Training language models to follow instructions with human feedback
  193. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  194. BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
  195. An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining
  196. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Frontiers in pharmacology 11 (2020), 565644.
  197. Improving language understanding by generative pre-training. (2018).
  198. Language models are unsupervised multitask learners. OpenAI blog (2019), 9.
  199. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  200. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  201. Few Shot Protein Generation
  202. Frank P Ramsey. 1923. Tractatus Logico-Philosophicus.
  203. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems.
  204. MSA transformer. In International Conference on Machine Learning. PMLR, 8844–8856.
  205. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. PNAS (2019).
  206. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems 33 (2020), 12559–12571.
  207. The RCSB Protein Data Bank: redesigned web site and web services. Nucleic acids research 39, suppl_1 (2010), D392–D401.
  208. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence 4, 12 (2022), 1256–1264.
  209. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of chemical information and modeling 52, 11 (2012), 2864–2875.
  210. Nicole Rusk. 2018. Sequence-based prediction of variants’ effects. Nature Methods 15, 8 (2018), 571–571.
  211. HIPPIE: Integrating protein interaction networks with experiment based quality scores. PloS one 7, 2 (2012), e31826.
  212. Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15. Springer, 593–607.
  213. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling 56, 12 (2016), 2336–2346.
  214. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science 5, 9 (2019), 1572–1583.
  215. Mapping the space of chemical reactions using attention-based neural networks. Nature machine intelligence 3, 2 (2021), 144–152.
  216. Efficient and accurate sequence generation with small-scale protein language models. bioRxiv (2023), 2023–08.
  217. Generative power of a protein language model trained on multiple sequence alignments. Elife 12 (2023), e79854.
  218. Talking About Large Language Models
  219. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. npj Computational Materials 9, 1 (2023), 52.
  220. BioMegatron: Larger Biomedical Domain Language Model
  221. Generative language modeling for antibody design. bioRxiv (2021), 2021–12.
  222. Garrett A Soukup. 2001. Nucleic acids: General properties. e LS (2001).
  223. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature methods 16, 7 (2019), 603–606.
  224. Martin Steinegger and Johannes Söding. 2018. Clustering huge protein sequence sets in linear time. Nature communications 9, 1 (2018)
  225. Teague Sterling and John J Irwin. 2015. ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling 55, 11 (2015), 2324–2337.
  226. The mammalian gene collection. Science 286, 5439 (1999), 455–457.
  227. A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language
  228. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv (2023), 2023–10.
  229. ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics. Journal of cheminformatics 9 (2017), 1–9.
  230. SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
  231. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 10 (2007), 1282–1288.
  232. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 6 (2015), 926–932.
  233. Galactica: A Large Language Model for Science
  234. Galactica: A Large Language Model for Science
  235. Gemini: A Family of Highly Capable Multimodal Models
  236. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nature communications 11, 1 (2020)
  237. Unbiasing retrosynthesis language models with disconnection prompts. ACS Central Science 9, 7 (2023), 1488–1498.
  238. Enhancing diversity in language based models for single-step retrosynthesis. Digital Discovery 2, 2 (2023), 489–501.
  239. LLaMA: Open and Efficient Foundation Language Models. CoRR (2023).
  240. LLaMA: Open and Efficient Foundation Language Models
  241. Survey of Protein Sequence Embedding Models. International Journal of Molecular Sciences 24, 4 (2023)
  242. Tuan Tran and Chinwe Ekenna. 2023. Molecular Descriptors Property Prediction Using Transformer-Based Approach. International Journal of Molecular Sciences 24, 15 (2023), 11948.
  243. Zhengkai Tu and Connor W Coley. 2022. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. Journal of chemical information and modeling 62, 15 (2022), 3503–3513.
  244. Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nature communications 13, 1 (2022)
  245. Exploiting Pretrained Biochemical Language Models for Targeted Drug Design
  246. Learning functional properties of proteins with language models. Nature Machine Intelligence 4, 3 (2022), 227–245.
  247. Foldseek: fast and accurate protein structure search. Biorxiv (2022), 2022–02.
  248. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research 50, D1 (11 2021), D439–D444.
  249. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.
  250. Deciphering the protein landscape with ProtFlash: a lightweight language model. Cell Reports Physical Science 4, 10 (2023), 101600.
  251. The PDBbind database: methodologies and updates. Journal of medicinal chemistry 48, 12 (2005), 4111–4119.
  252. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. 429–436.
  253. A pre-trained conditional transformer for Target-specific De Novo Molecular Generation. (2022).
  254. miProBERT: identification of microRNA promoters based on the pre-trained model BERT. Briefings in bioinformatics 24, 3 (2023), bbad093.
  255. UNI-RNA: universal pre-trained models revolutionize RNA research. bioRxiv (2023), 2023–07.
  256. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research 37, suppl_2 (2009), W623–W633.
  257. PubChem’s BioAssay database. Nucleic acids research 40, D1 (2012), D400–D412.
  258. cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation. Molecules 28, 11 (2023)
  259. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports 12, 1 (2022)
  260. InstructProtein: Aligning Human and Protein Language via Knowledge Instruction
  261. Multi-level Protein Structure Pre-training via Prompt Learning. In The Eleventh International Conference on Learning Representations.
  262. David Weininger. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28, 1 (1988), 31–36.
  263. Crowdsourcing Multiple Choice Science Questions
  264. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3, 1 (2016), 1–9.
  265. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 46, D1 (2018), D1074–D1082.
  266. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5312–5320.
  267. MoleculeNet: a benchmark for molecular machine learning. Chemical science 9, 2 (2018), 513–530.
  268. wwPDB consortium. 2018. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research 47, D1 (10 2018), D520–D528.
  269. A Systematic Survey of Chemical Pre-trained Models. IJCAI.
  270. Gearnet: Stepwise dual learning for weakly supervised domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8717–8725.
  271. DARWIN Series: Domain Specific Large Language Models for Natural Science
  272. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Research 49, W1 (2021), W5–W14.
  273. Hanwen Xu and Sheng Wang. 2022. ProTranslator: zero-shot protein function prediction using textual description. In International Conference on Research in Computational Molecular Biology. Springer, 279–294.
  274. Multilingual translation for zero-shot biomedical classification using BioTranslator. Nature Communications 14, 1 (2023), 738.
  275. How Powerful are Graph Neural Networks?
  276. ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
  277. PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding
  278. X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv (2020), 2020–12.
  279. Hideki Yamaguchi and Yutaka Saito. 2022. EvoOpt: an MSA-guided, fully unsupervised sequence optimization pipeline for protein design. In Machine Learning for Structural Biology Workshop, NeurIPS.
  280. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence 4, 10 (2022), 852–866.
  281. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
  282. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic acids research 41, D1 (2012), D1096–D1103.
  283. LOGO, a contextualized pre-trained language model of human genome flexibly adapts to various downstream tasks by fine-tuning. (2021).
  284. LinkBERT: Pretraining Language Models with Document Links
  285. First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track
  286. Selformer: Molecular representation learning via selfies language models. Machine Learning: Science and Technology (2023).
  287. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Research (2023), gkad1004.
  288. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations (ICLR).
  289. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32, 12 (2016), i121–i127.
  290. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications 13, 1 (2022), 862.
  291. Interactive Molecular Discovery with Natural Language
  292. DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks. bioRxiv (2023), 2023–07.
  293. Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation
  294. OntoProtein: Protein Pretraining With Gene Ontology Embedding
  295. OPT: Open Pre-trained Transformer Language Models
  296. BERTScore: Evaluating Text Generation with BERT
  297. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in bioinformatics 22, 6 (2021), bbab152.
  298. Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. Research 2022 (2022)
  299. Prediction of multiple types of RNA modifications via biological language model. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2023).
  300. Multiple sequence-alignment-based RNA language model and its application to structural inference. bioRxiv (2023), 2023–03.
  301. Yang Zhang and Jeffrey Skolnick. 2007. Scoring function for automated assessment of protein structure template quality. PROTEINS-NEW YORK- 68, 4 (2007)
  302. A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
  303. A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
  304. A Survey of Large Language Models
  305. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of chemical information and modeling 60, 1 (2019), 47–55.
  306. Structure-informed Language Models Are Protein Designers. bioRxiv (2023).
  307. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis. Journal of the American Chemical Society 145, 32 (aug 2023), 18048–18062.
  308. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
  309. Uni-Mol: a universal 3D molecular representation learning framework. (2023).
  310. Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling
  311. Jian Zhou and Olga G Troyanskaya. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods 12, 10 (2015), 931–934.
  312. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
  313. Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
  314. Learning Invariant Molecular Representation in Latent Discrete Space
  315. Graph Sampling-based Meta-Learning for Molecular Property Prediction
  316. Transmol: repurposing a language model for molecular generation. RSC advances 11, 42 (2021), 25921–25932.
  317. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv (2022).

Show All 317