Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficiently Predicting Mutational Effect on Homologous Proteins by Evolution Encoding (2402.13418v2)

Published 20 Feb 2024 in cs.LG and q-bio.BM

Abstract: Predicting protein properties is paramount for biological and medical advancements. Current protein engineering mutates on a typical protein, called the wild-type, to construct a family of homologous proteins and study their properties. Yet, existing methods easily neglect subtle mutations, failing to capture the effect on the protein properties. To this end, we propose EvolMPNN, Evolution-aware Message Passing Neural Network, an efficient model to learn evolution-aware protein embeddings. EvolMPNN samples sets of anchor proteins, computes evolutionary information by means of residues and employs a differentiable evolution-aware aggregation scheme over these sampled anchors. This way, EvolMPNN can efficiently utilise a novel message-passing method to capture the mutation effect on proteins with respect to the anchor proteins. Afterwards, the aggregated evolution-aware embeddings are integrated with sequence embeddings to generate final comprehensive protein embeddings. Our model shows up to 6.4% better than state-of-the-art methods and attains 36X inference speedup in comparison with large pre-trained models. Code and models are available at https://github.com/zhiqiangzhongddu/EvolMPNN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. James A Wells. Additivity of mutational effects in proteins. Biochemistry, 29(37):8509–8517, 1990.
  2. Deep mutational scanning: a new style of protein science. Nature methods, 11(8):801–807, 2014.
  3. Evaluating protein transfer learning with TAPE. In Proceedings of the 2019 Annual Conference on Neural Information Processing Systems (NeurIPS), pages 9686–9698, 2019.
  4. FLIP: benchmark tasks in fitness landscape inference for proteins. In Proceedings of the 2021 Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
  5. PEER: A comprehensive and multi-task benchmark for protein sequence understanding. In Proceedings of the 2022 Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  6. The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Sciences, 37(4):205–211, 1951.
  7. Protein networks in disease. Genome Research, 18(4):644–652, 2008.
  8. Kevin M Ulmer. Protein engineering. Science, 219(4585):666–671, 1983.
  9. Paul J Carter. Introduction to current and future protein therapeutics: a protein engineering perspective. Experimental cell research, 317(9):1261–1269, 2011.
  10. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
  11. Homology modelling and protein engineering strategy of subtilases, the family of subtilisin-like serine proteinases. Protein Engineering, Design and Selection, 4(7):719–737, 1991.
  12. Biocatalyst development by directed evolution. Bioresource Technology, 115:117–125, 2012.
  13. Applications of protein engineering and directed evolution in plant research. Plant Physiology, 179(3):907–917, 2019.
  14. Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
  15. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
  16. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife, 5:e16965, 2016.
  17. Harnessing computational biology for exact linear b-cell epitope prediction: a novel amino acid composition-based feature descriptor. Omics: a journal of integrative biology, 19(10):648–658, 2015.
  18. Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19:269–275, 2000.
  19. Long short-term memory. Neural Computation, 1997.
  20. Dilated residual networks. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), pages 636–644. IEEE, 2017.
  21. Attention is all you need. In Proceedings of the 2017 Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008, 2017.
  22. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
  23. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  24. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  25. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proceedings of the 2021 Annual Conference on Neural Information Processing Systems (NeurIPS), pages 29287–29303, 2021.
  26. William R Pearson. An introduction to sequence similarity (“homology”) searching. Current Protocols in Bioinformatics, 42(1):3–1, 2013.
  27. Multiple sequence alignment modeling: methods and applications. Briefings in Bioinformatics, 17(6):1009–1023, 2016.
  28. The search for common origin: homology revisited. Systematic Biology, 68(5):767–780, 2019.
  29. High thermodynamic stability of parametrically designed helical bundles. science, 346(6208):481–485, 2014.
  30. De novo design of bioactive protein switches. Nature, 572(7768):205–210, 2019.
  31. The detection and classification of membrane-spanning proteins. Biochimica et Biophysica Acta (BBA)-Biomembranes, 815(3):468–476, 1985.
  32. Possum: a bioinformatics toolkit for generating numerical sequence feature descriptors based on pssm profiles. Bioinformatics, 33(17):2756–2758, 2017.
  33. Yujian Li and Bi Liu. A normalized levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(6):1091–1095, 2007.
  34. Sean R Eddy. Where did the blosum62 alignment score matrix come from? Nature Biotechnology, 22(8):1035–1036, 2004.
  35. Predicting protein function from sequence and structure. Nature Reviews Molecular Cell Biology, 8(12):995–1005, 2007.
  36. Distributed representations of words and phrases and their compositionality. In Proceedings of the 2013 Annual Conference on Neural Information Processing Systems (NIPS), pages 3111–3119, 2013.
  37. Phoscontext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Scientific Reports, 8(1):8240, 2018.
  38. A k-mer grammar analysis to uncover maize regulatory architecture. BMC Plant Biology, 19(1):1–17, 2019.
  39. MSA transformer. In Proceedings of the 2021 International Conference on Machine Learning (ICML), volume 139, pages 8844–8856. PMLR, 2021.
  40. Transformer protein language models are unsupervised structure learners. In Proceedings of the 2021 International Conference on Learning Representations (ICLR), 2021.
  41. Torchdrug: A powerful and flexible machine learning platform for drug discovery. CoRR, abs/2202.08320, 2022.
  42. Graph random neural networks for semi-supervised learning on graphs. In Proceedings of the 2021 Annual Conference on Neural Information Processing Systems (NeurIPS), pages 28877–28888, 2021.
  43. Randomized positional encodings boost length generalization of transformers. In Proceedings of the 2023 Annual Meeting of the Association for Computational Linguistics (ACL), pages 1889–1903. ACL, 2023.
  44. Hierarchical graph learning for protein–protein interaction. Nature Communications, 14(1):1093, 2023.
  45. Nodeformer: A scalable graph structure learning transformer for node classification. In Proceedings of the 2022 Annual Conference on Neural Information Processing Systems (NeurIPS), pages 27387–27401, 2022.
  46. Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289, 2015.
  47. Layer normalization. CoRR, abs/1607.06450, 2016.
  48. Construction of phylogenetic trees: a method based on mutation distances as estimated from cytochrome c sequences is of general applicability. Science, 155(3760):279–284, 1967.
  49. David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35–42, 1975.
  50. Knowledge-augmented graph machine learning for drug discovery: A survey from precision to interpretability. CoRR, abs/2302.08261, 2023.
  51. A new perspective on "how graph neural networks go beyond weisfeiler-lehman?". In Proceedings of the 2022 International Conference on Learning Representations (ICLR), 2022.
  52. Position-aware graph neural networks. In Proceedings of the 2019 International Conference on Machine Learning (ICML), volume 97, pages 7134–7143. PMLR, 2019.
  53. Jean Bourgain. On lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics, 52:46–52, 1985.
  54. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15:215–245, 1995.
  55. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(10):7112–7127, 2022.
  56. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  57. Semi-supervised classification with graph convolutional networks. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), 2017.
  58. Graph attention networks. In Proceedings of the 2018 International Conference on Learning Representations (ICLR), 2018.
  59. Masked label prediction: Unified message passing model for semi-supervised classification. In Proceedings of the 2021 International Joint Conferences on Artifical Intelligence (IJCAI), pages 1548–1554, 2021.
  60. On nearest-neighbor graphs. Discrete & Computational Geometry, 17:263–282, 1997.
  61. Tailoring the aav vector capsid for gene therapy. Gene Therapy, 16(3):311–319, 2009.
  62. Engineering the aav capsid to optimize vector–host-interactions. Current Opinion in Pharmacology, 24:94–104, 2015.
  63. Engineering the aav capsid to evade immune responses. Current opinion in biotechnology, 60:99–103, 2019.
  64. Crystal structure of the c2 fragment of streptococcal protein g in complex with the fc domain of human igg. Structure, 3(3):265–278, 1995.
  65. Streptococcal protein g. gene structure and protein binding properties. Journal of Biological Chemistry, 266(1):399–405, 1991.
  66. Roger Y Tsien. The green fluorescent protein. Annual Review of Biochemistry, 67(1):509–544, 1998.
Citations (1)

Summary

We haven't generated a summary for this paper yet.