Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning (2408.00057v1)

Published 31 Jul 2024 in q-bio.BM and cs.LG

Abstract: Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein LLMs by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous fusion methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Andrea Agiollo and Andrea Omicini. 2022. GNN2GNN: Graph neural networks to generate neural networks. In Conference on Uncertainty in Artificial Intelligence. https://api.semanticscholar.org/CorpusID:252898899
  2. Mohammed AlQuraishi. 2019. ProteinNet: a standardized data set for machine learning of protein structure. BMC bioinformatics 20 (2019), 1–10.
  3. Gene Ontology: tool for the unification of biology. Nature Genetics 25 (2000), 25–29.
  4. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38 (2021), 2102 – 2110.
  5. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35 (2019), i305 – i314. https://api.semanticscholar.org/CorpusID:196809757
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
  7. Ken A. Dill and Justin L. MacCallum. 2012. The Protein-Folding Problem, 50 Years On. Science 338 (2012), 1042 – 1046. https://api.semanticscholar.org/CorpusID:5756068
  8. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2022), 7112–7127.
  9. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare (HEALTH) 3 (2020), 1 – 23. https://api.semanticscholar.org/CorpusID:220919723
  10. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34, 17 (09 2018), i802–i810. https://doi.org/10.1093/bioinformatics/bty573 arXiv:https://academic.oup.com/bioinformatics/article-pdf/34/17/i802/50582268/bioinformatics_34_17_i802.pdf
  11. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770–778. https://api.semanticscholar.org/CorpusID:206594692
  12. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735–1780. https://api.semanticscholar.org/CorpusID:1915014
  13. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34 (2017), 1295 – 1303. https://api.semanticscholar.org/CorpusID:4891436
  14. Highly accurate protein structure prediction with AlphaFold. Nature 596 (2021), 583 – 589.
  15. Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=SJU4ayYgl
  16. Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery. ArXiv abs/2306.12802 (2023).
  17. Deep Neural Network Based Predictions of Protein Interactions Using Primary Sequences. Molecules : A Journal of Synthetic Chemistry and Natural Product Chemistry 23 (2018). https://api.semanticscholar.org/CorpusID:51906277
  18. Large language models in bioinformatics: applications and perspectives. ArXiv (2024). https://api.semanticscholar.org/CorpusID:266899789
  19. INDIGO: GNN-Based Inductive Knowledge Graph Completion Using Pair-Wise Encoding. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:245119728
  20. K-BERT: Enabling Language Representation with Knowledge Graph. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:202583325
  21. Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.
  22. Iain H. Moal and Juan Fernández-Recio. 2012. SKEMPI: a Structural Kinetic and Energetic database of Mutant Protein Interactions and its use in empirical models. Bioinformatics 28 20 (2012), 2600–7. https://api.semanticscholar.org/CorpusID:39983995
  23. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems.
  24. MSA Transformer. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8844–8856. https://proceedings.mlr.press/v139/rao21a.html
  25. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America 118 (2019).
  26. The Graph Neural Network Model. IEEE Transactions on Neural Networks 20 (2009), 61–80. https://api.semanticscholar.org/CorpusID:206756462
  27. Modeling Relational Data with Graph Convolutional Networks. In Extended Semantic Web Conference. https://api.semanticscholar.org/CorpusID:5458500
  28. Improved protein structure prediction using potentials from deep learning. Nature 577 (2020), 706–710.
  29. Ruoxi Sun. 2022. Does GNN Pretraining Help Molecular Representation? ArXiv abs/2207.06010 (2022). https://api.semanticscholar.org/CorpusID:250493139
  30. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding. ArXiv abs/1907.12412 (2019).
  31. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23 10 (2007), 1282–8.
  32. Learning functional properties of proteins with language models. Nature Machine Intelligence 4 (2022), 227 – 245. https://api.semanticscholar.org/CorpusID:247642074
  33. Attention is All you Need. In NIPS.
  34. Chemical-Reaction-Aware Molecule Representation Learning. ArXiv abs/2109.09888 (2021). https://api.semanticscholar.org/CorpusID:237581512
  35. KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation. Transactions of the Association for Computational Linguistics 9 (2019), 176–194.
  36. Modeling Protein Using Large-scale Pretrain Language Model. ArXiv abs/2108.07435 (2021).
  37. ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. ArXiv abs/2301.12040 (2023).
  38. KG-BERT: BERT for Knowledge Graph Completion. CoRR abs/1909.03193 (2019). https://api.semanticscholar.org/CorpusID:202539519
  39. OntoProtein: Protein Pretraining With Gene Ontology Embedding. In International Conference on Learning Representations. https://openreview.net/forum?id=yfe1VMYAXa4
  40. ERNIE: Enhanced Language Representation with Informative Entities. In Annual Meeting of the Association for Computational Linguistics.
  41. Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=VbCMhg7MRmj

Summary

We haven't generated a summary for this paper yet.