Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ProtFAD: Introducing function-aware domains as implicit modality towards protein function prediction (2405.15158v2)

Published 24 May 2024 in q-bio.BM and cs.LG

Abstract: Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we align the domain semantics with GO terms and text description to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor. Our implementation is available at https://github.com/AI-HPC-Research-Team/ProtFAD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Molecular dynamics and protein function. Proceedings of the National Academy of Sciences, 102(19):6679–6685, 2005.
  2. A perspective on enzyme catalysis. Science, 301(5637):1196–1202, 2003.
  3. Protein–protein interactions define specificity in signal transduction. Genes & development, 14(9):1027–1047, 2000.
  4. Diffdock: Diffusion steps, twists, and turns for molecular docking. In International Conference on Learning Representations (ICLR), 2023.
  5. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381(6664):eadg7492, 2023.
  6. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021.
  7. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
  8. Machine learning for functional protein design. Nature Biotechnology, 42(2):216–228, 2024.
  9. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction. Computers in Biology and Medicine, 154:106446, 2023.
  10. Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022.
  11. A systematic study of joint representation learning on protein sequences and structures. arXiv preprint arXiv:2303.06275, 2023.
  12. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports, 12(1):6832, 2022.
  13. Enhancing protein function prediction performance by utilizing alphafold-predicted protein structures. Journal of Chemical Information and Modeling, 62(17):4008–4017, 2022.
  14. Hierarchical graph transformer with contrastive learning for protein function prediction. Bioinformatics, 39(7):btad410, 2023.
  15. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  16. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  17. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pages 1–3, 2024.
  18. Benchmarking alphafold for protein complex modeling reveals accuracy determinants. Protein Science, 31(8):e4379, 2022.
  19. Amy O Stevens and Yi He. Benchmarking the accuracy of alphafold 2 in loop structure prediction. Biomolecules, 12(7):985, 2022.
  20. Benchmarking alphafold2 on peptide structure prediction. Structure, 31(1):111–119, 2023.
  21. Netgo 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic acids research, 49(W1):W469–W475, 2021.
  22. Domain-pfp allows protein function prediction using function-aware domain embedding representations. Communications Biology, 6(1):1103, 2023.
  23. Sdn2go: an integrated deep learning model for protein function prediction. Frontiers in bioengineering and biotechnology, 8:391, 2020.
  24. Graph2go: a multi-modal attributed network embedding method for inferring protein functions. GigaScience, 9(8):giaa081, 2020.
  25. Protein function prediction for newly sequenced organisms. Nature Machine Intelligence, 3(12):1050–1060, 2021.
  26. Netgo: improving large-scale protein function prediction with massive network information. Nucleic acids research, 47(W1):W379–W387, 2019.
  27. Golabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics, 34(14):2465–2473, 2018.
  28. Capturing protein domain structure and function using self-supervision on domain architectures. Algorithms, 14(1):28, 2021.
  29. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
  30. Bbln: A bilateral-branch learning network for unknown protein–protein interaction prediction. Computers in Biology and Medicine, 167:107588, 2023.
  31. Submdta: drug target affinity prediction based on substructure extraction and multi-scale features. BMC bioinformatics, 24(1):334, 2023.
  32. Attentionmgt-dta: A multi-modal drug-target affinity prediction using graph transformer and attention mechanism. Neural Networks, 169:623–636, 2024.
  33. Multimodal contrastive representation learning for drug-target binding affinity prediction. Methods, 220:126–133, 2023.
  34. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  35. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  36. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
  37. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pages 2023–10, 2023.
  38. Msa transformer. In International Conference on Machine Learning, pages 8844–8856. PMLR, 2021.
  39. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations, 2021.
  40. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. International Conference on Learning Representations, 2021.
  41. Learning hierarchical protein representations via complete 3d graph networks. In International Conference on Learning Representations (ICLR), 2023.
  42. Structure-aware protein self-supervised learning. Bioinformatics, 39(4):btad189, 2023.
  43. Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675, 2022.
  44. Pre-training protein encoder via siamese sequence-structure diffusion trajectory prediction. Advances in Neural Information Processing Systems, 36, 2024.
  45. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations, 2023.
  46. Pre-training sequence, structure, and surface features for comprehensive protein representation learning. In The Twelfth International Conference on Learning Representations, 2023.
  47. Multimodal distillation of protein sequence, structure, and function, 2024.
  48. A multimodal protein representation framework for quantifying transferability across biochemical downstream tasks. Advanced Science, page 2301223, 2023.
  49. Viet Thanh Duy Nguyen and Truong Son Hy. Multimodal pretraining for unsupervised protein representation learning. bioRxiv, pages 2023–11, 2023.
  50. Mulaxialgo: Multi-modal feature-enhanced deep learning model for protein function prediction. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 132–137. IEEE, 2023.
  51. Deepgraphgo: graph neural network for large-scale, multispecies protein function prediction. Bioinformatics, 37(Supplement_1):i262–i271, 2021.
  52. Msf-pfp: A novel multisource feature fusion model for protein function prediction. Journal of Chemical Information and Modeling, 2024.
  53. Assigning protein function from domain-function associations using domfun. BMC bioinformatics, 23(1):1–19, 2022.
  54. Predicting protein function from domain content. Bioinformatics, 24(15):1681–1687, 2008.
  55. Protein domain recurrence and order can enhance prediction of protein functions. Bioinformatics, 28(18):i444–i450, 2012.
  56. InterPro in 2022. Nucleic Acids Research, 51(D1):D418–D427, 11 2022.
  57. Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems, 36:22099–22114, 2023.
  58. Geometric multimodal contrastive representation learning. In International Conference on Machine Learning, pages 17782–17800. PMLR, 2022.
  59. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  60. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  61. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  62. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020.
  63. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  64. The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 11 2022.
  65. InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9):1236–1240, 01 2014.
  66. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pages 9929–9939. PMLR, 2020.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 6 likes.

Upgrade to Pro to view all of the tweets about this paper: