Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey (2403.01528v2)

Published 3 Mar 2024 in cs.CL, cs.AI, and q-bio.BM

Abstract: The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (181)
  1. D. Weininger, “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,” Journal of chemical information and computer sciences, vol. 28, no. 1, pp. 31–36, 1988.
  2. D. Weininger, A. Weininger, and J. L. Weininger, “Smiles. 2. algorithm for generation of unique smiles notation,” Journal of chemical information and computer sciences, vol. 29, no. 2, pp. 97–101, 1989.
  3. W. R. Pearson, “Using the fasta program to search protein and dna sequence databases,” Computer Analysis of Sequence Data: Part I, pp. 307–331, 1994.
  4. S. Chithrananda, G. Grand, and B. Ramsundar, “Chemberta: large-scale self-supervised pretraining for molecular property prediction,” arXiv preprint arXiv:2010.09885, 2020.
  5. M. Xu, Z. Zhang, J. Lu, Z. Zhu, Y. Zhang, C. Ma, R. Liu, and J. Tang, “PEER: A comprehensive and multi-task benchmark for protein sequence understanding,” in NeurIPS, 2022.
  6. A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,” Proceedings of the National Academy of Sciences, vol. 118, no. 15, p. e2016239118, 2021.
  7. Y. Wang, J. Wang, Z. Cao, and A. B. Farimani, “Molecular contrastive learning of representations via graph neural networks,” Nat. Mach. Intell., vol. 4, no. 3, pp. 279–287, 2022.
  8. C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Liu, “Do transformers really perform badly for graph representation?” in NeurIPS, 2021, pp. 28 877–28 888.
  9. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with alphafold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
  10. J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. Wicky, A. Courbet, R. J. de Haas, N. Bethel et al., “Robust deep learning–based protein sequence design using proteinmpnn,” Science, vol. 378, no. 6615, pp. 49–56, 2022.
  11. G. Zhou, Z. Gao, Q. Ding, H. Zheng, H. Xu, Z. Wei, L. Zhang, and G. Ke, “Uni-mol: A universal 3d molecular representation learning framework,” in ICLR.   OpenReview.net, 2023.
  12. K. Canese and S. Weis, “Pubmed: the bibliographic database,” The NCBI handbook, vol. 2, no. 1, 2013.
  13. S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu et al., “Pubchem 2023 update,” Nucleic acids research, vol. 51, no. D1, pp. D1373–D1380, 2023.
  14. E. Boutet, D. Lieberherr, M. Tognolli, M. Schneider, and A. Bairoch, “Uniprotkb/swiss-prot: the manually annotated section of the uniprot knowledgebase,” Plant bioinformatics: methods and protocols, 2007.
  15. D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202.   PMLR, 2023, pp. 8469–8488.
  16. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  17. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
  18. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  19. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  20. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
  21. C. Edwards, T. M. Lai, K. Ros, G. Honke, K. Cho, and H. Ji, “Translation between molecules and natural language,” in EMNLP.   Association for Computational Linguistics, 2022, pp. 375–413.
  22. Q. Pei, W. Zhang, J. Zhu, K. Wu, K. Gao, L. Wu, Y. Xia, and R. Yan, “Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations,” in EMNLP.   Association for Computational Linguistics, 2023, pp. 1102–1123.
  23. Y. Luo, X. Y. Liu, K. Yang, K. Huang, M. Hong, J. Zhang, Y. Wu, and Z. Nie, “Towards unified ai drug discovery with multiple knowledge modalities,” arXiv preprint arXiv:2305.01523, 2023.
  24. Z. Zeng, Y. Yao, Z. Liu, and M. Sun, “A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals,” Nature communications, vol. 13, no. 1, p. 862, 2022.
  25. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1).   Association for Computational Linguistics, 2019, pp. 4171–4186.
  26. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinform., vol. 36, no. 4, pp. 1234–1240, 2020.
  27. I. Beltagy, K. Lo, and A. Cohan, “Scibert: A pretrained language model for scientific text,” in EMNLP/IJCNLP (1).   Association for Computational Linguistics, 2019, pp. 3613–3618.
  28. E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop, 2019, pp. 72–78.
  29. Y. Peng, S. Yan, and Z. Lu, “Transfer learning in biomedical natural language processing: An evaluation of BERT and elmo on ten benchmarking datasets,” in BioNLP@ACL.   Association for Computational Linguistics, 2019, pp. 58–65.
  30. S. Alrowili and V. Shanker, “Biom-transformers: Building large biomedical language models with bert, ALBERT and ELECTRA,” in BioNLP@NAACL-HLT.   Association for Computational Linguistics, 2021, pp. 221–227.
  31. Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” ACM Trans. Comput. Heal., vol. 3, no. 1, pp. 2:1–2:23, 2022.
  32. H. Shin, Y. Zhang, E. Bakhturina, R. Puri, M. Patwary, M. Shoeybi, and R. Mani, “Biomegatron: Larger biomedical domain language model,” in EMNLP (1).   Association for Computational Linguistics, 2020, pp. 4700–4706.
  33. Z. Hong, A. Ajith, J. G. Pauloski, E. Duede, K. Chard, and I. T. Foster, “The diminishing returns of masked language models to science,” in ACL (Findings).   Association for Computational Linguistics, 2023, pp. 1270–1283.
  34. M. Yasunaga, J. Leskovec, and P. Liang, “Linkbert: Pretraining language models with document links,” in ACL (1).   Association for Computational Linguistics, 2022, pp. 8003–8016.
  35. X. Yang, A. Chen, N. PourNejatian, H. C. Shin, K. E. Smith, C. Parisien, C. Compas, C. Martin, M. G. Flores, Y. Zhang et al., “Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records,” arXiv preprint arXiv:2203.03540, 2022.
  36. R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T. Liu, “Biogpt: generative pre-trained transformer for biomedical text generation and mining,” Briefings Bioinform., vol. 23, no. 6, 2022.
  37. E. Bolton, D. Hall, M. Yasunaga, T. Lee, C. Manning, and P. Liang, “Biomedlm,” 2022. [Online]. Available: https://crfm.stanford.edu/2022/12/15/biomedlm.html
  38. C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Pmc-llama: Further finetuning llama on medical papers,” arXiv preprint arXiv:2304.14454, 2023.
  39. Y. Luo, J. Zhang, S. Fan, K. Yang, Y. Wu, M. Qiao, and Z. Nie, “Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine,” arXiv preprint arXiv:2308.09442, 2023.
  40. C. Peng, X. Yang, A. Chen, K. E. Smith, N. PourNejatian, A. B. Costa, C. Martin, M. G. Flores, Y. Zhang, T. Magoc et al., “A study of generative large language model for medical research and healthcare,” arXiv preprint arXiv:2305.13523, 2023.
  41. Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan, A. Köpf, A. Mohtashami et al., “Meditron-70b: Scaling medical pretraining for large language models,” arXiv preprint arXiv:2311.16079, 2023.
  42. R. K. Luu and M. J. Buehler, “Bioinspiredllm: Conversational large language model for the mechanics of biological and bio-inspired materials,” Advanced Science, p. 2306724, 2023.
  43. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.
  44. K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023.
  45. D. Zhang, Z. Hu, S. Zhoubian, Z. Du, K. Yang, Z. Wang, Y. Yue, Y. Dong, and J. Tang, “Sciglm: Training scientific language models with self-reflective instruction annotation and tuning,” arXiv preprint arXiv:2401.07950, 2024.
  46. A. Toma, P. R. Lawler, J. Ba, R. G. Krishnan, B. B. Rubin, and B. Wang, “Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding,” arXiv preprint arXiv:2305.12031, 2023.
  47. T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. Löser, D. Truhn, and K. K. Bressem, “Medalpaca–an open-source collection of medical conversational ai models and training data,” arXiv preprint arXiv:2304.08247, 2023.
  48. G. Wang, G. Yang, Z. Du, L. Fan, and X. Li, “Clinicalgpt: Large language models finetuned with diverse medical data and comprehensive evaluation,” arXiv preprint arXiv:2306.09968, 2023.
  49. H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y. Xie, and S. Yu, “Biobart: Pretraining and evaluation of a biomedical generative language model,” arXiv preprint arXiv:2204.03905, 2022.
  50. L. N. Phan, J. T. Anibal, H. Tran, S. Chanana, E. Bahadroglu, A. Peltekian, and G. Altan-Bonnet, “Scifive: a text-to-text transformer model for biomedical literature,” arXiv preprint arXiv:2106.03598, 2021.
  51. M. Yasunaga, A. Bosselut, H. Ren, X. Zhang, C. D. Manning, P. Liang, and J. Leskovec, “Deep bidirectional language-knowledge graph pretraining,” in NeurIPS, 2022.
  52. C. Qian, H. Tang, Z. Yang, H. Liang, and Y. Liu, “Can large language models empower molecular property prediction?” arXiv preprint arXiv:2307.07443, 2023.
  53. S. Balaji, R. Magar, Y. Jadhav et al., “Gpt-molberta: Gpt molecular features language model for molecular property prediction,” arXiv preprint arXiv:2310.03030, 2023.
  54. Z. Liu, W. Zhang, Y. Xia, L. Wu, S. Xie, T. Qin, M. Zhang, and T. Liu, “Molxpt: Wrapping molecules with text for generative pre-training,” in ACL (2).   Association for Computational Linguistics, 2023, pp. 1606–1616.
  55. J. Li, Y. Liu, W. Fan, X.-Y. Wei, H. Liu, J. Tang, and Q. Li, “Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective,” arXiv preprint arXiv:2306.06615, 2023.
  56. A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. White, and P. Schwaller, “Augmenting large language models with chemistry tools,” in NeurIPS 2023 AI for Science Workshop, 2023.
  57. Y. Shi, A. Zhang, E. Zhang, Z. Liu, and X. Wang, “Relm: Leveraging language models for enhanced chemical reaction prediction,” in EMNLP (Findings).   Association for Computational Linguistics, 2023, pp. 5506–5520.
  58. Z. Zhao, D. Ma, L. Chen, L. Sun, Z. Li, H. Xu, Z. Zhu, S. Zhu, S. Fan, G. Shen et al., “Chemdfm: Dialogue foundation model for chemistry,” arXiv preprint arXiv:2401.14818, 2024.
  59. G. Ye, X. Cai, H. Lai, X. Wang, J. Huang, L. Wang, W. Liu, and X. Zeng, “Drugassist: A large language model for molecule optimization,” arXiv preprint arXiv:2401.10334, 2023.
  60. B. Yu, F. N. Baker, Z. Chen, X. Ning, and H. Sun, “Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset,” arXiv preprint arXiv:2402.09391, 2024.
  61. T. Guo, B. Nan, Z. Liang, Z. Guo, N. Chawla, O. Wiest, X. Zhang et al., “What can large language models do in chemistry? a comprehensive benchmark on eight tasks,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  62. D. Zhang, W. Liu, Q. Tan, J. Chen, H. Yan, Y. Yan, J. Li, W. Huang, X. Yue, D. Zhou et al., “Chemllm: A chemical large language model,” arXiv preprint arXiv:2402.06852, 2024.
  63. L. Zhao, C. Edwards, and H. Ji, “What a scientific language model knows and doesn’t know about chemistry,” in NeurIPS 2023 AI for Science Workshop, 2023.
  64. D. Christofidellis, G. Giannone, J. Born, O. Winther, T. Laino, and M. Manica, “Unifying molecular and textual representations via multi-task language modelling,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202.   PMLR, 2023, pp. 6140–6157.
  65. Y. Chen, N. Xi, Y. Du, H. Wang, C. Jianyu, S. Zhao, and B. Qin, “From artificially real to real: Leveraging pseudo data from large language models for low-resource molecule discovery,” arXiv preprint arXiv:2309.05203, 2023.
  66. Z. Zeng, B. Yin, S. Wang, J. Liu, C. Yang, H. Yao, X. Sun, M. Sun, G. Xie, and Z. Liu, “Interactive molecular discovery with natural language,” arXiv preprint arXiv:2306.11976, 2023.
  67. M. Livne, Z. Miftahutdinov, E. Tutubalina, M. Kuznetsov, D. Polykovskiy, A. Brundyn, A. Jhunjhunwala, A. Costa, A. Aliper, and A. Zhavoronkov, “nach0: Multimodal natural and chemical languages foundation model,” arXiv preprint arXiv:2311.12410, 2023.
  68. H. Qiu, L. Liu, X. Qiu, X. Dai, X. Ji, and Z.-Y. Sun, “Polync: a natural and chemical language model for the prediction of unified polymer properties,” Chemical Science, vol. 15, no. 2, pp. 534–544, 2024.
  69. S. Kim, J. Nam, S. Yu, Y. Shin, and J. Shin, “Data-efficient molecular generation with hierarchical textual inversion,” 2024. [Online]. Available: https://openreview.net/forum?id=wwotGBxtC3
  70. Y. Qian, Z. Li, Z. Tu, C. W. Coley, and R. Barzilay, “Predictive chemistry augmented with text retrieval,” in EMNLP.   Association for Computational Linguistics, 2023, pp. 12 731–12 745.
  71. D. Oniani, J. Hilsman, C. Zang, J. Wang, L. Cai, J. Zawala, and Y. Wang, “Emerging opportunities of using large language language models for translation between drug molecules and indications,” arXiv preprint arXiv:2402.09588, 2024.
  72. H. Guo, S. Zhao, H. Wang, Y. Du, and B. Qin, “Moltailor: Tailoring chemical molecular representation to specific tasks via text prompts,” arXiv preprint arXiv:2401.11403, 2024.
  73. S. Liu, W. Nie, C. Wang, J. Lu, Z. Qiao, L. Liu, J. Tang, C. Xiao, and A. Anandkumar, “Multi-modal molecule structure–text model for text-based retrieval and editing,” Nature Machine Intelligence, vol. 5, no. 12, pp. 1447–1457, 2023.
  74. P. Seidl, A. Vall, S. Hochreiter, and G. Klambauer, “Enhancing activity prediction models in drug discovery with the ability to understand human language,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202.   PMLR, 2023, pp. 30 458–30 490.
  75. Z. Liu, S. Li, Y. Luo, H. Fei, Y. Cao, K. Kawaguchi, X. Wang, and T. Chua, “Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter,” in EMNLP.   Association for Computational Linguistics, 2023, pp. 15 623–15 638.
  76. S. Li, Z. Liu, Y. Luo, X. Wang, X. He, K. Kawaguchi, T.-S. Chua, and Q. Tian, “Towards 3d molecule-text interpretation in language models,” in ICLR.   Openreview.net, 2024.
  77. P. Liu, Y. Ren, J. Tao, and Z. Ren, “Git-mol: A multi-modal large language model for molecular science with graph, image, and text,” Computers in Biology and Medicine, p. 108073, 2024.
  78. J. Fang, S. Zhang, C. Wu, Z. Liu, S. Li, K. Wang, W. Du, X. Wang, and X. He, “Moltc: Towards molecular relational modeling in language models,” arXiv preprint arXiv:2402.03781, 2024.
  79. Y. Liang, R. Zhang, L. Zhang, and P. Xie, “Drugchat: towards enabling chatgpt-like capabilities on drug molecule graphs,” arXiv preprint arXiv:2309.03907, 2023.
  80. Y. Liu, H. Xu, T. Fang, H. Xi, Z. Liu, S. Zhang, H. Poon, and S. Wang, “T-rex: Text-assisted retrosynthesis prediction,” arXiv preprint arXiv:2401.14637, 2024.
  81. H. Zhao, S. Liu, C. Ma, H. Xu, J. Fu, Z.-H. Deng, L. Kong, and Q. Liu, “GIMLET: A unified graph-text model for instruction-based molecule zero-shot learning,” in NeurIPS, 2023.
  82. W. Zhao, D. Zhou, B. Cao, K. Zhang, and J. Chen, “Adversarial modality alignment network for cross-modal molecule retrieval,” IEEE Transactions on Artificial Intelligence, 2023.
  83. B. Su, D. Du, Z. Yang, Y. Zhou, J. Li, A. Rao, H. Sun, Z. Lu, and J.-R. Wen, “A molecular multimodal foundation model associating molecule graphs with natural language,” arXiv preprint arXiv:2209.05481, 2022.
  84. R. Lacombe, A. Gaut, J. He, D. Lüdeke, and K. Pistunova, “Extracting molecular properties from natural language with multimodal contrastive learning,” arXiv preprint arXiv:2307.12996, 2023.
  85. X. Tang, A. Tran, J. Tan, and M. B. Gerstein, “Mollm: A unified language model to integrate biomedical text with 2d and 3d molecular representations,” bioRxiv, pp. 2023–11, 2023.
  86. Y. Luo, K. Yang, M. Hong, X. Liu, and Z. Nie, “Molfm: A multimodal molecular foundation model,” arXiv preprint arXiv:2307.09484, 2023.
  87. H. Cao, Z. Liu, X. Lu, Y. Yao, and Y. Li, “Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery,” arXiv preprint arXiv:2311.16208, 2023.
  88. Y. Luo, S. Li, Z. Liu, J. Wu, Z. Yang, X. He, X. Wang, and Q. Tian, “Text-guided diffusion model for 3d molecule generation,” 2024. [Online]. Available: https://openreview.net/forum?id=FdUloEgBSE
  89. H. Gong, Q. Liu, S. Wu, and L. Wang, “Text-guided molecule generation with diffusion language model,” arXiv preprint arXiv:2402.13040, 2024.
  90. H. W. Sprueill, C. Edwards, K. Agarwal, M. V. Olarte, U. Sanyal, C. Johnston, H. Liu, H. Ji, and S. Choudhury, “Chemreasoner: Heuristic search over a large language model’s knowledge space using quantum-chemical feedback,” arXiv preprint arXiv:2402.10980, 2024.
  91. C. Edwards, C. Zhai, and H. Ji, “Text2mol: Cross-modal molecule retrieval with natural language queries,” in EMNLP (1).   Association for Computational Linguistics, 2021, pp. 595–607.
  92. Z. Wang, Q. Zhang, K. Ding, M. Qin, X. Zhuang, X. Li, and H. Chen, “Instructprotein: Aligning human and protein language via knowledge instruction,” arXiv preprint arXiv:2310.03269, 2023.
  93. S. Liu, Y. Zhu, J. Lu, Z. Xu, W. Nie, A. Gitter, C. Xiao, J. Tang, H. Guo, and A. Anandkumar, “A text-guided protein design framework,” arXiv preprint arXiv:2302.04611, 2023.
  94. H. Abdine, M. Chatzianastasis, C. Bouyioukos, and M. Vazirgiannis, “Prot2text: Multimodal protein’s function generation with gnns and transformers,” in NeurIPS 2023 AI for Science Workshop, 2023.
  95. N. Zhang, Z. Bi, X. Liang, S. Cheng, H. Hong, S. Deng, Q. Zhang, J. Lian, and H. Chen, “Ontoprotein: Protein pretraining with gene ontology embedding,” in ICLR.   OpenReview.net, 2022.
  96. H. Xu and S. Wang, “Protranslator: zero-shot protein function prediction using textual description,” in International Conference on Research in Computational Molecular Biology.   Springer, 2022, pp. 279–294.
  97. M. Xu, X. Yuan, S. Miret, and J. Tang, “Protst: Multi-modality learning of protein sequences and biomedical texts,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202.   PMLR, 2023, pp. 38 749–38 767.
  98. C. Wang, H. Fan, R. Quan, and Y. Yang, “Protchatgpt: Towards understanding proteins with large language models,” 2024.
  99. A. Ghafarollahi and M. J. Buehler, “Protagents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning,” arXiv preprint arXiv:2402.04268, 2024.
  100. H. Guo, M. Huo, R. Zhang, and P. Xie, “Proteinchat: Towards achieving chatgpt-like functionalities on protein 3d structures,” 2023.
  101. S. Liu, J. Wang, Y. Yang, C. Wang, L. Liu, H. Guo, and C. Xiao, “Conversational drug editing using retrieval and domain feedback,” in ICLR.   Openreview.net, 2024.
  102. Y. Fang, X. Liang, N. Zhang, K. Liu, R. Huang, Z. Chen, X. Fan, and H. Chen, “Mol-instructions - a large-scale biomolecular instruction dataset for large language models,” in ICLR.   Openreview.net, 2024.
  103. Z. Wang, Z. Wang, B. Srinivasan, V. N. Ioannidis, H. Rangwala, and R. ANUBHAI, “Biobridge: Bridging biomedical foundation models via knowledge graph,” in ICLR.   Openreview.net, 2024.
  104. R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language model for science,” arXiv preprint arXiv:2211.09085, 2022.
  105. H. Xu, A. Woicik, H. Poon, R. B. Altman, and S. Wang, “Multilingual translation for zero-shot biomedical classification using biotranslator,” Nature Communications, vol. 14, no. 1, p. 738, 2023.
  106. T. Xie, Y. Wan, W. Huang, Z. Yin, Y. Liu, S. Wang, Q. Linghu, C. Kit, C. Grazian, W. Zhang et al., “Darwin series: Domain specific large language models for natural science,” arXiv preprint arXiv:2308.13565, 2023.
  107. Q. Pei, L. Wu, K. Gao, X. Liang, Y. Fang, J. Zhu, S. Xie, T. Qin, and R. Yan, “Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning,” arXiv preprint arXiv:2402.17810, 2024.
  108. Y. Fang, K. Liu, N. Zhang, X. Deng, P. Yang, Z. Chen, X. Tang, M. Gerstein, X. Fan, and H. Chen, “Chatcell: Facilitating single-cell analysis with natural language,” arXiv preprint arXiv:2402.08303, 2024.
  109. N. O’Boyle and A. Dalke, “Deepsmiles: an adaptation of smiles for use in machine-learning of chemical structures,” ChemRxiv, 2018.
  110. M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-Guzik, “Self-referencing embedded strings (selfies): A 100% robust molecular string representation,” Machine Learning: Science and Technology, vol. 1, no. 4, p. 045024, 2020.
  111. S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi, and I. Pletnev, “Inchi-the worldwide chemical structure identifier standard,” Journal of cheminformatics, vol. 5, no. 1, pp. 1–9, 2013.
  112. G. Landrum et al., “Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling,” Greg Landrum, vol. 8, p. 31, 2013.
  113. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  114. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  115. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
  116. Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” in International Conference on Learning Representations, 2016.
  117. P. D. Sun, C. E. Foster, and J. C. Boyington, “Overview of protein structural and functional folds,” Current protocols in protein science, vol. 35, no. 1, pp. 17–1, 2004.
  118. Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.
  119. J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI open, vol. 1, pp. 57–81, 2020.
  120. Z. Guo, K. Guo, B. Nan, Y. Tian, R. G. Iyer, Y. Ma, O. Wiest, X. Zhang, W. Wang, C. Zhang, and N. V. Chawla, “Graph-based molecular representation learning,” in IJCAI.   ijcai.org, 2023, pp. 6638–6646.
  121. M. Jiang, S. Wang, S. Zhang, W. Zhou, Y. Zhang, and Z. Li, “Sequence-based drug-target affinity prediction using weighted graph neural networks,” BMC genomics, vol. 23, no. 1, pp. 1–17, 2022.
  122. K. Jha, S. Saha, and H. Singh, “Prediction of protein–protein interaction using graph neural networks,” Scientific Reports, vol. 12, no. 1, p. 8360, 2022.
  123. J. Zhu, K. Wu, B. Wang, Y. Xia, S. Xie, Q. Meng, L. Wu, T. Qin, W. Zhou, H. Li, and T. Liu, “O-gnn: incorporating ring priors into molecular modeling,” in ICLR.   OpenReview.net, 2023.
  124. C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein, “Weisfeiler and lehman go cellular: Cw networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 2625–2640, 2021.
  125. B. Jing, S. Eismann, P. Suriana, R. J. L. Townshend, and R. O. Dror, “Learning from protein structure with geometric vector perceptrons,” in ICLR.   OpenReview.net, 2021.
  126. S. Luo, T. Chen, Y. Xu, S. Zheng, T. Liu, L. Wang, and D. He, “One transformer can understand both 2d & 3d molecular data,” in ICLR.   OpenReview.net, 2023.
  127. W. Jin, R. Barzilay, and T. S. Jaakkola, “Antibody-antigen docking and design via hierarchical structure refinement,” in ICML, ser. Proceedings of Machine Learning Research, vol. 162.   PMLR, 2022, pp. 10 217–10 227.
  128. D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” Journal of chemical information and modeling, vol. 50, no. 5, pp. 742–754, 2010.
  129. S. Jaeger, S. Fulle, and S. Turk, “Mol2vec: unsupervised machine learning approach with chemical intuition,” Journal of chemical information and modeling, vol. 58, no. 1, pp. 27–35, 2018.
  130. P. Liu, J. Tao, and Z. Ren, “Scientific language modeling: A quantitative review of large language models in molecular science,” arXiv preprint arXiv:2402.04119, 2024.
  131. G. O. Consortium, “The gene ontology (go) database and informatics resource,” Nucleic acids research, vol. 32, no. suppl_1, pp. D258–D261, 2004.
  132. Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis, H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, M. Parmar, M. Purohit, N. Varshney, P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K. Sampat, S. Mishra, S. R. A, S. Patro, T. Dixit, and X. Shen, “Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks,” in EMNLP.   Association for Computational Linguistics, 2022, pp. 5085–5109.
  133. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in NeurIPS, 2022.
  134. S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu et al., “Instruction tuning for large language models: A survey,” arXiv preprint arXiv:2308.10792, 2023.
  135. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in ICLR.   OpenReview.net, 2022.
  136. Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science, vol. 9, no. 2, pp. 513–530, 2018.
  137. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  138. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in NeurIPS, 2020.
  139. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL.   Association for Computational Linguistics, 2020, pp. 7871–7880.
  140. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.
  141. J. Zhu, Y. Xia, L. Wu, S. Xie, W. Zhou, T. Qin, H. Li, and T.-Y. Liu, “Dual-view molecular pre-training,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 3615–3627.
  142. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  143. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  144. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR.   OpenReview.net, 2022.
  145. M. R. AI4Science and M. A. Quantum, “The impact of large language models on scientific discovery: a preliminary study using gpt-4,” arXiv preprint arXiv:2311.07361, 2023.
  146. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in EMNLP (1).   Association for Computational Linguistics, 2021, pp. 3045–3059.
  147. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, 2022.
  148. H. Nori, Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu et al., “Can generalist foundation models outcompete special-purpose tuning? case study in medicine,” arXiv preprint arXiv:2311.16452, 2023.
  149. Y. Qiu, Y. Zhang, Y. Deng, S. Liu, and W. Zhang, “A comprehensive review of computational methods for drug-drug interaction detection,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 19, no. 4, pp. 1968–1985, 2021.
  150. L. Hu, X. Wang, Y.-A. Huang, P. Hu, and Z.-H. You, “A survey on computational models for predicting protein–protein interactions,” Briefings in bioinformatics, vol. 22, no. 5, p. bbab036, 2021.
  151. M. Bagherian, E. Sabeti, K. Wang, M. A. Sartor, Z. Nikolovska-Coleska, and K. Najarian, “Machine learning approaches and databases for prediction of drug–target interaction: a survey paper,” Briefings in bioinformatics, vol. 22, no. 1, pp. 247–269, 2021.
  152. R. Sever, T. Roeder, S. Hindle, L. Sussman, K.-J. Black, J. Argentine, W. Manos, and J. R. Inglis, “biorxiv: the preprint server for biology,” BioRxiv, p. 833400, 2019.
  153. K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. S. Weld, “S2ORC: the semantic scholar open research corpus,” in ACL.   Association for Computational Linguistics, 2020, pp. 4969–4983.
  154. A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, no. 1, pp. 1–9, 2016.
  155. J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman, “Zinc: a free tool to discover chemistry for biology,” Journal of chemical information and modeling, vol. 52, no. 7, pp. 1757–1768, 2012.
  156. C. v. Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, and B. Snel, “String: a database of predicted functional associations between proteins,” Nucleic acids research, vol. 31, no. 1, pp. 258–261, 2003.
  157. Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” in EMNLP/IJCNLP (1).   Association for Computational Linguistics, 2019, pp. 2567–2577.
  158. B. Aydin, Y. S. Y. Y. S. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas, “Crowdsourcing for multiple-choice question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28, no. 2, 2014, pp. 2946–2953.
  159. A. Nentidis, K. Bougiatiotis, A. Krithara, and G. Paliouras, “Results of the seventh edition of the bioasq challenge,” in Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II.   Springer, 2020, pp. 553–568.
  160. K. Huang, T. Fu, W. Gao, Y. Zhao, Y. Roohani, J. Leskovec, C. W. Coley, C. Xiao, J. Sun, and M. Zitnik, “Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development,” in NeurIPS Datasets and Benchmarks, 2021.
  161. J. Lu and Y. Zhang, “Unified deep learning model for multitask reaction predictions with explanation,” Journal of Chemical Information and Modeling, vol. 62, no. 6, pp. 1376–1387, 2022.
  162. W. Jin, K. Yang, R. Barzilay, and T. Jaakkola, “Learning multimodal graph-to-graph translation for molecular optimization,” arXiv preprint arXiv:1812.01070, 2018.
  163. C. Dallago, J. Mou, K. E. Johnston, B. J. Wittmann, N. Bhattacharya, S. Goldman, A. Madani, and K. K. Yang, “FLIP: benchmark tasks in fitness landscape inference for proteins,” in NeurIPS Datasets and Benchmarks, 2021.
  164. R. Rao, N. Bhattacharya, N. Thomas, Y. Duan, X. Chen, J. F. Canny, P. Abbeel, and Y. S. Song, “Evaluating protein transfer learning with TAPE,” in NeurIPS, 2019, pp. 9686–9698.
  165. C. Edwards, Q. Wang, L. Zhao, and H. Ji, “L+ m-24: Building a dataset for language+ molecules@ acl 2024.”
  166. D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang, and J. Woolsey, “Drugbank: a comprehensive resource for in silico drug discovery and exploration,” Nucleic acids research, vol. 34, no. suppl_1, pp. D668–D672, 2006.
  167. L. Sun, Y. Han, Z. Zhao, D. Ma, Z. Shen, B. Chen, L. Chen, and K. Yu, “Scieval: A multi-level large language model evaluation benchmark for scientific research,” arXiv preprint arXiv:2308.13149, 2023.
  168. I. Jahan, M. T. R. Laskar, C. Peng, and J. X. Huang, “A comprehensive evaluation of large language models on benchmark biomedical text processing tasks,” Computers in Biology and Medicine, p. 108189, 2024.
  169. C. M. Castro Nascimento and A. S. Pimentel, “Do large language models understand chemistry? a conversation with chatgpt,” Journal of Chemical Information and Modeling, vol. 63, no. 6, pp. 1649–1655, 2023.
  170. K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero, and B. Smit, “Leveraging large language models for predictive chemistry,” Nature Machine Intelligence, pp. 1–9, 2024.
  171. S. Ouyang, Z. Zhang, B. Yan, X. Liu, J. Han, and L. Qin, “Structured chemistry reasoning with large language models,” arXiv preprint arXiv:2311.09656, 2023.
  172. W. Lin, H. Wang, H. Xiao, and Q. Ye, “OPI: Exploring and Benchmarking Large Language Models for Protein Modeling,” 2023. [Online]. Available: https://github.com/baaihealth/opi
  173. C. W. Kosonocky, C. O. Wilke, E. M. Marcotte, and A. D. Ellington, “Mining patents with large language models elucidates the chemical function landscape,” ArXiv, 2023.
  174. K. M. Jablonka, Q. Ai, A. Al-Feghali, S. Badhwar, J. D. Bocarsly, A. M. Bran, S. Bringuier, L. C. Brinson, K. Choudhary, D. Circi et al., “14 examples of how llms can transform materials science and chemistry: a reflection on a large language model hackathon,” Digital Discovery, vol. 2, no. 5, pp. 1233–1250, 2023.
  175. D. S. Wishart, S. Girod, H. Peters, E. Oler, J. Jovel, Z. Budinski, R. Milford, V. W. Lui, Z. Sayeeda, R. Mah et al., “Chemfont: the chemical functional ontology resource,” Nucleic Acids Research, vol. 51, no. D1, pp. D1220–D1229, 2023.
  176. L. Lv, Z. Lin, H. Li, Y. Liu, J. Cui, C. Y.-C. Chen, L. Yuan, and Y. Tian, “Prollama: A protein large language model for multi-task protein language processing,” arXiv preprint arXiv:2402.16445, 2024.
  177. S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
  178. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  179. P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in NeurIPS, 2020.
  180. Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
  181. Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang, C. Xiao, C. Han et al., “Tool learning with foundation models,” arXiv preprint arXiv:2304.08354, 2023.
Citations (14)

Summary

  • The paper details how multi-modal learning fuses 1D, 2D, and 3D biomolecular representations with natural language for enhanced data analysis.
  • It reviews transformer and multi-stream architectures that capture latent features across modalities using self-supervised and cross-modal training strategies.
  • It highlights practical applications such as predictive modeling and molecule design while addressing challenges like tokenization and dataset scarcity.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Comprehensive Survey

Introduction

The intersection of biomolecular modeling and NLP presents a fertile ground for interdisciplinary advancements in the fields of artificial intelligence, chemistry, and biology. This survey explores the recent progress in cross-modeling of biomolecules and language, encapsulated as BL – a nomenclature representing the hybridization of biomolecular data and linguistic descriptions. By engaging with various data modalities, including textual information alongside molecular and protein depictions in sequences, 2D graphs, and 3D structures, BL aims to enrich our comprehension of biomolecules from both structural and linguistic perspectives.

Biomolecule Representation

A critical step in BL involves the accurate and effective representation of biomolecules. This survey identifies three primary forms of biomolecular data representation:

  • 1D Sequences: Encoding biomolecules as chains of monomers or chemical symbols, including SMILES for molecules and FASTA sequences for proteins.
  • 2D Graphs: Presenting molecules as nodes and bonds as edges in graphs, extending to protein representation through techniques like residue contact maps.
  • 3D Structures: Emphasizing the spatial configurations of biomolecules, which is vital for understanding their functional and interaction properties.

Integration Rationales and Objectives

Cross-modeling strives to harness the intertwined nature of textual and biomolecular data to achieve a multidimensional understanding and foresee applications beyond the scope of each domain independently. This endeavor is guided by objectives spanning from representation learning, which employs self-supervised learning to generate embeddings that capture the essence of both data modalities, to instruction following and developing agent/assistant models that interact with users to provide contextual information or fulfill specific queries.

Learning Frameworks

The exploration of various neural network architectures underpins the progress in BL. Transformer models, including encoder-only, decoder-only, and encoder-decoder frameworks, emerge as central to this exploration. Additionally, dual/multi-stream models leverage the strength of modality-specific encoders. A noteworthy extension is the PaLM-E-style architecture, which combines pre-trained LLMs with external modality-specific encoders, showcasing the adaptive integration of voluminous language data and biomolecular specifics.

Representation Learning Methodologies

Fundamental to advancing BL models, representation learning methodologies facilitate pre-training on expansive unlabeled datasets to capture latent features across modalities. This encompasses tasks like masked LLMing (MLM) and next token prediction (NTP) for language, and specialized tasks like cross-modal alignment (CMA) and self-contrastive learning (SCL) applicable to biomolecule data. Various training strategies, ranging from multi-stage training to leverage LLMs for domain adaptation, reflect the complexity and potential of this approach.

Practical Applications

From predictive modeling of biomolecule properties and interactions to the generative design of molecules and proteins based on textual descriptions, the application spectrum of BL models is expansive. These models facilitate tasks such as molecule-to-text retrieval, multi-modal optimization of biomolecules, and transformation between different molecular representations, showcasing their versatility and utility across both scientific research and practical applications.

Challenges and Future Directions

While significant strides have been made, the field of BL confronts challenges like specialized tokenization for biomolecules, scarcity of large-scale multimodal datasets, task generalization beyond data generalization, adaptability of LLMs to biological domains, and ethical issues surrounding biotechnological advancements powered by AI. Addressing these challenges paves the way for future developments, underscoring the need for enhanced methodologies, broader collaboration across disciplines, and ethical frameworks that guide the responsible use of AI in biology and chemistry.

Conclusion

In conclusion, leveraging biomolecule and natural language through multi-modal learning stands as an instrumental advance in unifying the realms of AI, chemistry, and biology. By navigating the opportunities and addressing the challenges detailed in this survey, the scientific community is poised to unlock deeper insights into biomolecular phenomena, ushering in a new era of discovery and innovation.