Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship (2405.06663v1)

Published 29 Apr 2024 in q-bio.BM, cs.AI, and cs.LG

Abstract: The goal of protein representation learning is to extract knowledge from protein databases that can be applied to various protein-related downstream tasks. Although protein sequence, structure, and function are the three key modalities for a comprehensive understanding of proteins, existing methods for protein representation learning have utilized only one or two of these modalities due to the difficulty of capturing the asymmetric interrelationships between them. To account for this asymmetry, we introduce our novel asymmetric multi-modal masked autoencoder (AMMA). AMMA adopts (1) a unified multi-modal encoder to integrate all three modalities into a unified representation space and (2) asymmetric decoders to ensure that sequence latent features reflect structural and functional information. The experiments demonstrate that the proposed AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships, which in turn makes it effective for various downstream protein-related tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, pp.  348–367. Springer, 2022.
  2. Lysine decarboxylase catalyzes the first step of quinolizidine alkaloid biosynthesis and coevolved with alkaloid production in leguminosae. The Plant Cell, 24(3):1202–1216, 2012.
  3. Structure-aware protein self-supervised learning. Bioinformatics, 39(4):btad189, 2023.
  4. UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1):D506–D515, 2019.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
  6. Autodock vina 1.2. 0: New docking methods, expanded force field, and python bindings. Journal of chemical information and modeling, 61(8):3891–3898, 2021.
  7. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
  8. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
  9. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  10. Contrastive learning on protein embeddings enlightens midnight zone. NAR genomics and bioinformatics, 4(2):lqac043, 2022.
  11. Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675, 2022.
  12. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins: Structure, Function, and Bioinformatics, 77(3):499–508, 2009.
  13. The interface of protein structure, protein biophysics, and molecular evolution. Protein Science, 21(6):769–785, 2012.
  14. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022.
  15. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  16. Hfsp: high speed homology-driven function annotation of proteins. Bioinformatics, 34(13):i304–i312, 2018.
  17. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  18. Cath–a hierarchic classification of protein domain structures. Structure, 5(8):1093–1109, 1997.
  19. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  20. Sequence-structure-function relationships in class i mhc: A local frustration perspective. PloS one, 15(5):e0232849, 2020.
  21. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pp.  2023–10, 2023.
  22. Foldseek: fast and accurate protein structure search. Biorxiv, pp.  2022–02, 2022.
  23. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  24. Protst: Multi-modality learning of protein sequences and biomedical texts. International Conference on Machine Learning, 2023.
  25. Ontoprotein: Protein pretraining with gene ontology embedding. International Conference on Learning Representations, 2022.
  26. Pepharmony: A multi-view contrastive learning framework for integrated sequence and structure-based peptide encoding. arXiv preprint arXiv:2401.11360, 2024.
  27. Protein representation learning by geometric structure pretraining. International Conference on Learning Representations, 2023.
  28. Torchdrug: A powerful and flexible machine learning platform for drug discovery. arXiv preprint arXiv:2202.08320, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com