Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning (2402.17810v2)

Published 27 Feb 2024 in q-bio.QM, cs.AI, cs.CE, cs.LG, and q-bio.BM

Abstract: Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Prot2text: Multimodal protein’s function generation with GNNs and transformers. In NeurIPS 2023 AI for Science Workshop.
  2. Interpretable bilinear attention network with domain adaptation improves drug–target prediction. Nature Machine Intelligence, 5(2):126–136.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Kathi Canese and Sarah Weis. 2013. Pubmed: the bibliographic database. The NCBI handbook, 2(1).
  6. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. CoRR, abs/2311.16208.
  7. Transformercpi: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics, 36(16):4406–4414.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.
  10. Unifying molecular and textual representations via multi-task language modelling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 6140–6157. PMLR.
  11. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, 20(3):273–297.
  12. Machine learning in drug discovery: A review. Artif. Intell. Rev., 55(3):1947–1999.
  13. Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6):1273–1280.
  14. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 375–413. Association for Computational Linguistics.
  15. Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 595–607. Association for Computational Linguistics.
  16. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127.
  17. Aohan Zeng et.al. 2023. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR).
  18. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134.
  19. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. CoRR, abs/2306.08018.
  20. Zhi-Ping Feng and Chun-Ting Zhang. 2000. Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of protein chemistry, 19:269–275.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  22. Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, pages 278–282.
  23. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  24. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  25. MolTrans: Molecular interaction transformer for drug–target interaction prediction. Bioinformatics, 37:830 – 836.
  26. Zinc20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling, 60(12):6065–6073.
  27. Pubchem 2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109.
  28. Detection of IUPAC and iupac-like chemical names. In Proceedings 16th International Conference on Intelligent Systems for Molecular Biology (ISMB), Toronto, Canada, July 19-23, 2008, pages 268–276.
  29. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024.
  30. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics.
  31. Greg Landrum. 2021. Rdkit: Open-source cheminformatics software. GitHub release.
  32. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Computational Biology, 15.
  33. Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. arXiv preprint arXiv:2306.06615.
  34. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  35. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv.
  36. David J Lipman and William R Pearson. 1985. Rapid and sensitive protein similarity searches. Science, 227(4693):1435–1441.
  37. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics, 31(12):i221–i229.
  38. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. CoRR, abs/2308.06911.
  39. Pre-training molecular graph representation with 3d geometry. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  40. A text-guided protein design framework. CoRR, abs/2302.04611.
  41. Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: Fine-tuned llama outperforms GPT-4 on arithmetic tasks. CoRR, abs/2305.14201.
  42. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 35(suppl_1):D198–D201.
  43. Molxpt: Wrapping molecules with text for generative pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1606–1616. Association for Computational Linguistics.
  44. MolCA: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  45. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  46. Empowering ai drug discovery with explicit and implicit knowledge. arXiv preprint arXiv:2305.01523.
  47. Molfm: A multimodal molecular foundation model. CoRR, abs/2307.09484.
  48. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. CoRR, abs/2308.09442.
  49. Levenshtein distance: Information theory, computer science, string (computer science), string metric, damerau? levenshtein distance, spell checker, hamming distance.
  50. GraphDTA: Predicting drug-target binding affinity with graph neural networks. Bioinformatics, 37(8):1140–1147.
  51. Investigating the limitations of the transformers with simple arithmetic tasks. CoRR, abs/2102.13019.
  52. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  53. Keiron O’Shea and Ryan Nash. 2015. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.
  54. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
  55. William R Pearson and David J Lipman. 1988. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448.
  56. BioT5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1102–1123, Singapore. Association for Computational Linguistics.
  57. Fréchet chemnet distance: A metric for generative models for molecules in drug discovery. J. Chem. Inf. Model., 58(9):1736–1741.
  58. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  59. STOUT: SMILES to IUPAC names using neural machine translation. J. Cheminformatics, 13(1):34.
  60. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118.
  61. David Rogers and Mathew Hahn. 2010. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754.
  62. Vijayakumar Saravanan and Namasivayam Gautham. 2015. Harnessing computational biology for exact linear b-cell epitope prediction: a novel amino acid composition-based feature descriptor. Omics: a journal of integrative biology, 19(10):648–658.
  63. Get your atoms in order - an open-source implementation of a novel and robust molecular canonicalization algorithm. J. Chem. Inf. Model., 55(10):2111–2120.
  64. biorxiv: the preprint server for biology. BioRxiv, page 833400.
  65. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481.
  66. Bern2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics, 38(20):4837–4839.
  67. Uniref: comprehensive and non-redundant uniprot reference clusters. Bioinformatics, 23(10):1282–1288.
  68. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  69. Galactica: A large language model for science. CoRR, abs/2211.09085.
  70. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  71. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  72. Attention is all you need. Advances in neural information processing systems, 30.
  73. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287.
  74. David Weininger. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36.
  75. Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences, 29(2):97–101.
  76. Jacob White. 2020. Pubmed 2.0. Medical reference services quarterly, 39(4):382–387.
  77. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530.
  78. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. CoRR, abs/2304.01196.
  79. Multilingual translation for zero-shot biomedical classification using biotranslator. Nature Communications, 14(1):738.
  80. Peer: a comprehensive and multi-task benchmark for protein sequence understanding. Advances in Neural Information Processing Systems, 35:35156–35173.
  81. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. In Proceedings 16th International Conference on Intelligent Systems for Molecular Biology (ISMB), Toronto, Canada, July 19-23, 2008, pages 232–240.
  82. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823.
  83. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1):862.
  84. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34:15870–15882.
  85. Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  86. Biosnap datasets: Stanford biomedical network dataset collection. Note: http://snap. stanford. edu/biodata Cited by, 5(1).
Citations (17)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com