Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery (2306.12802v3)

Published 22 Jun 2023 in cs.LG, cs.AI, and q-bio.BM

Abstract: Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results for drug-target binding affinity prediction in the established Therapeutic Data Commons (TDC) benchmarks. We release a set of multimodal knowledge graphs, integrating data from seven public data sources, and containing over 30 million triples. Our intention is to foster additional research to explore how multimodal knowledge enhanced protein/molecule embeddings can improve prediction tasks, including prediction of binding affinity. We also release some pretrained models learned from our multimodal knowledge graphs, along with source code for running standard benchmark tasks for prediction of biding affinity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Gene ontology: tool for the unification of biology. Nature Genetics, May 2000.
  2. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 02 2022.
  3. The gene ontology knowledgebase in 2023. Genetics, 224(1), 2023.
  4. Building a knowledge graph to enable precision medicine. Nature Scientific Data, 2023.
  5. T. U. Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 11 2022.
  6. Prottrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. CoRR, abs/2007.06225, 2020.
  7. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230, 2020.
  8. Gnnautoscale: Scalable and expressive graph neural networks via historical embeddings. In International Conference on Machine Learning, pages 3294–3304. PMLR, 2021.
  9. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1):D1100–D1107, 09 2011.
  10. Zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design. medRxiv, 2023.
  11. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021.
  12. Artificial intelligence foundation for therapeutic science. Nature chemical biology, 2022.
  13. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Research, 42(D1):D1091–D1097, 11 2013.
  14. Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering, pages 1–17, 2022.
  15. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  16. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 35(suppl_1):D198–D201, 2007.
  17. Directory of useful decoys, enhanced (dud-e): better ligands and decoys for better benchmarking. Journal of medicinal chemistry, 55(14):6582–6594, 2012.
  18. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Research, 51(D1):D1353–D1359, 11 2022.
  19. J. Owoyemi and N. Medzhidov. Smilesformer: Language model for molecular design. 2023.
  20. Deepdta: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.
  21. Evaluating protein transfer learning with tape. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  22. Msa transformer. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021.
  23. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  24. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 118(15), Apr. 2021.
  25. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2019.
  26. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
  27. Molformer: Large scale chemical language representations capture molecular structure and properties. 2022.
  28. Modeling relational data with graph convolutional networks. In A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam, editors, The Semantic Web, pages 593–607, Cham, 2018. Springer International Publishing.
  29. Learning the drug-target interaction lexicon. bioRxiv, 2022.
  30. Adapting protein language models for rapid dti prediction. bioRxiv, pages 2022–11, 2022.
  31. Corpus processing service: A knowledge graph platform to perform deep data exploration on corpora. Applied AI Letters, 1(2):e20, 2020.
  32. The string database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic acids research, 51(D1):D638–D646, 2023.
  33. Stitch 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic acids research, 44(D1):D380–D384, 2016.
  34. Dtigems+: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques. Journal of Cheminformatics, 12(1), July 2020. KAUST Repository Item: Exported on 2020-10-01 Acknowledged KAUST grant number(s): BAS/1/1606-01-01, BAS/1/1059-01-01, BAS/1/1624-01-01, FCC/1/1976-17-01, FCC/1/1976-26-01. Acknowledgements: The research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST).
  35. Modeling protein using large-scale pretrain language model. CoRR, abs/2108.07435, 2021.
  36. Selformer: Molecular representation learning via selfies language models. arXiv preprint arXiv:2304.04662, 2023.
  37. Ontoprotein: Protein pretraining with gene ontology embedding. In International Conference on Learning Representations, 2022.
  38. Protein representation learning via knowledge enhanced primary structure reasoning. In The Eleventh International Conference on Learning Representations, 2023.
  39. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com