Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to design protein-protein interactions with enhanced generalization (2310.18515v3)

Published 27 Oct 2023 in cs.LG

Abstract: Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Flex ddg: Rosetta ensemble-based estimation of changes in protein–protein binding affinity upon mutation. The Journal of Physical Chemistry B, 122(21):5389–5399, May 2018. ISSN 1520-6106. doi: 10.1021/acs.jpcb.7b11367. URL https://doi.org/10.1021/acs.jpcb.7b11367.
  2. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  3. Global distribution of conformational states derived from redundant models in the pdb points to non-uniqueness of the protein structure. Proceedings of the National Academy of Sciences, 106(26):10505–10510, 2009. URL https://doi.org/10.1073/pnas.081215210.
  4. Pcalign: a method to quantify physicochemical similarity of protein-protein interfaces. BMC bioinformatics, 16(1):1–12, 2015.
  5. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  6. Beatmusic: prediction of changes in protein–protein binding affinity on mutations. Nucleic acids research, 41(W1):W333–W339, 2013.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Prop3d: A flexible, python-based platform for machine learning with protein structural properties and biophysical data. bioRxiv, 2022. doi: 10.1101/2022.12.27.522071. URL https://www.biorxiv.org/content/early/2022/12/30/2022.12.27.522071.
  9. Protein complex prediction with alphafold-multimer. biorxiv, pp.  2021–10, 2021.
  10. William Falcon and The PyTorch Lightning team. PyTorch Lightning, mar 2019. URL https://github.com/Lightning-AI/lightning.
  11. World stroke organization (wso): global stroke fact sheet 2022. International Journal of Stroke, 17(1):18–29, 2022.
  12. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
  13. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2):184–192, 2020.
  14. De novo design of protein interactions with learned surface fingerprints. Nature, pp.  1–9, 2023.
  15. Independent se (3)-equivariant models for end-to-end rigid protein docking. arXiv preprint arXiv:2111.07786, 2021.
  16. Mu Gao and Jeffrey Skolnick. ialign: a method for the structural comparison of protein–protein interfaces. Bioinformatics, 26(18):2259–2265, 2010a. URL https://doi.org/10.1093/bioinformatics/btq404.
  17. Mu Gao and Jeffrey Skolnick. Structural space of protein–protein interfaces is degenerate, close to complete, and highly connected. Proceedings of the National Academy of Sciences, 107(52):22517–22522, 2010b. URL 10.1073/pnas.1012820107.
  18. isee: Interface structure, evolution, and energy-based machine learning predictor of binding affinity changes upon mutations. Proteins: Structure, Function, and Bioinformatics, 87(2):110–119, 2019a.
  19. Finding the δ𝛿\deltaitalic_δδ𝛿\deltaitalic_δg spot: Are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it? Wiley Interdisciplinary Reviews: Computational Molecular Science, 9(5):e1410, 2019b.
  20. Learning inverse folding from millions of predicted structures. bioRxiv, 2022. doi: 10.1101/2022.04.10.487779.
  21. Targeting protein–protein interactions as an anticancer strategy. Trends in pharmacological sciences, 34(7):393–400, 2013.
  22. Graphein-a python library for geometric deep learning and network analysis on protein structures and interaction networks. bioRxiv, pp.  2020–07, 2020.
  23. Skempi 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics, 35(3):462–469, 2019.
  24. Unsupervised protein-ligand binding energy prediction via neural euler’s rotation equation. arXiv preprint arXiv:2301.10814, 2023.
  25. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
  26. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  27. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
  28. On the binding affinity of macromolecular interactions: daring to ask why proteins interact. Journal of The Royal Society Interface, 10(79):20120835, 2013.
  29. Diffdock-pp: Rigid protein-protein docking with diffusion models. arXiv preprint arXiv:2304.03889, 2023.
  30. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  31. De novo design of bioactive protein switches. Nature, 572(7768):205–210, 2019.
  32. Recombinant staphylokinase variants with reduced antigenicity due to elimination of b-lymphocyte epitopes. Blood, The Journal of the American Society of Hematology, 96(4):1425–1432, 2000.
  33. Macromolecular modeling and design in rosetta: recent methods and frameworks. Nature methods, 17(7):665–680, 2020.
  34. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs. arXiv preprint arXiv:2206.11990, 2022.
  35. Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations. arXiv preprint arXiv:2306.12059, 2023.
  36. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  37. Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS computational biology, 17(8):e1009284, 2021.
  38. Recent advances in the development of protein–protein interactions modulators: mechanisms and clinical trials. Signal transduction and targeted therapy, 5(1):213, 2020.
  39. Rotamer density estimator is an unsupervised learner of the effect of mutations on protein-protein interaction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_X9Yl1K2mD.
  40. Computational design of novel protein–protein interactions–an overview on methodological approaches and applications. Current Opinion in Structural Biology, 74:102370, 2022.
  41. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  42. Topology independent structural matching discovers novel templates for protein interfaces. Bioinformatics, 34(17):i787–i794, 2018. doi: 10.1093/bioinformatics/bty587.
  43. Dips-plus: The enhanced database of interacting protein structures for interface prediction. arXiv preprint arXiv:2106.04362, 2021.
  44. Computer-aided engineering of staphylokinase toward enhanced affinity and selectivity for plasmin. Computational and structural biotechnology journal, 20:1366–1377, 2022.
  45. Saambe-3d: predicting effect of mutations on protein–protein interactions. International journal of molecular sciences, 21(7):2563, 2020.
  46. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  47. Msa transformer. In International Conference on Machine Learning, pp.  8844–8856. PMLR, 2021.
  48. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature Methods, 9(2):173–175, Feb 2012. ISSN 1548-7105. doi: 10.1038/nmeth.1818. URL https://doi.org/10.1038/nmeth.1818.
  49. Calculation of accurate interatomic contact surface areas for the quantitative analysis of non-bonded molecular interactions. Bioinformatics, 35(18):3499–3501, 2019.
  50. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  51. mmcsm-ppi: predicting the effects of multiple point mutations on protein–protein interactions. Nucleic Acids Research, 49(W1):W417–W424, 2021.
  52. Generalized extracellular molecule sensor platform for programming cellular behavior. Nature chemical biology, 14(7):723–729, 2018.
  53. The foldx web server: an online force field. Nucleic acids research, 33(suppl_2):W382–W388, 2005.
  54. Deep learning guided optimization of human antibody against sars-cov-2 variants with broad neutralization. Proceedings of the National Academy of Sciences, 119(11):e2122954119, 2022.
  55. Quantitative comparison of protein-protein interaction interface using physicochemical feature-based descriptors of surface patches. Frontiers in Molecular Biosciences, 10, 2023a. URL https://doi.org/10.3389/fmolb.2023.1110567.
  56. Quantitative comparison of protein-protein interaction interface using physicochemical feature-based descriptors of surface patches. Frontiers in Molecular Biosciences, 10:1110567, 2023b.
  57. A structure-based deep learning framework for protein engineering. bioRxiv, pp.  833905, 2019.
  58. Rosettaddgprediction for high-throughput mutational scans: From stability to binding. Protein Science, 32(1):e4527, 2023. doi: https://doi.org/10.1002/pro.4527. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/pro.4527.
  59. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
  60. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  61. End-to-end learning on 3d protein structure for interface prediction. Advances in Neural Information Processing Systems, 32, 2019. URL https://doi.org/10.48550/arXiv.1807.01297.
  62. Fast and accurate protein structure search with foldseek. Nature Biotechnology, pp.  1–4, 2023. doi: https://doi.org/10.1038/s41587-023-01773-0.
  63. Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. Journal of molecular biology, 427(19):3031–3041, 2015.
  64. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nature Machine Intelligence, 2(2):116–123, 2020.
  65. De novo design of protein structure and function with rfdiffusion. Nature, pp.  1–3, 2023.
  66. Bindprofx: assessing mutation-induced binding affinity change by protein interface profiles with pseudo-counts. Journal of molecular biology, 429(3):426–434, 2017.
  67. Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023.
  68. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
  69. Enhancing protein language models with structure-based encoder and pre-training. arXiv preprint arXiv:2303.06275, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Anton Bushuiev (4 papers)
  2. Roman Bushuiev (4 papers)
  3. Petr Kouba (2 papers)
  4. Anatolii Filkin (1 paper)
  5. Marketa Gabrielova (1 paper)
  6. Michal Gabriel (1 paper)
  7. Jiri Sedlar (10 papers)
  8. Tomas Pluskal (4 papers)
  9. Jiri Damborsky (3 papers)
  10. Stanislav Mazurenko (7 papers)
  11. Josef Sivic (78 papers)
Citations (10)

Summary

Learning to Design Protein-Protein Interactions with Enhanced Generalization: An Essay

This paper presents a contribution to the domain of protein-protein interaction (PPI) design through machine learning methodologies by proposing a framework that promises improved generalization capabilities. The major contributions of the paper include the introduction of the PPIRef dataset, the conception of a new SE(3)-equivariant model named PPIformer, and demonstration of PPIformer’s superior performance in predicting mutation effects on PPIs.

PPIRef: A Comprehensive Dataset

The authors have curated PPIRef, the largest non-redundant dataset of 3D protein-protein interactions to date. The dataset leverages data from the Protein Data Bank (PDB) and addresses critical issues in existing datasets, including redundancy and bias. This is achieved using a novel scalable algorithm called iDist, which identifies and clusters structurally similar protein interfaces, thus enabling the construction of a comprehensive yet non-redundant dataset. PPIRef’s deduplication ensures diverse exposure to PPI structures during model training, which is crucial for achieving broad generalization.

PPIformer: A New Model Architecture

The PPIformer model stands out due to its SE(3)-equivariance, which ensures geometric invariance essential for processing protein structures. Built upon Equiformer graph attention layers, PPIformer processes coarse-grained representations of protein complexes. By doing so, it effectively mitigates the risk of overfitting associated with precise atomic structures. This model is pretrained on PPIRef using a structural masked modeling approach, which is a sophisticated self-supervised learning methodology. This pretraining approach leverages intrinsic structural patterns in protein interactions, enhancing the model's ability to generalize upon unseen data.

Enhanced Generalization in PPI Predictions

The model's generalization performance is further validated through fine-tuning tasks that predict the binding affinity changes upon mutations, a central challenge in PPI design. The fine-tuning utilizes a physics-informed thermodynamic loss, rooted in predicting the log-odds of amino acid likelihoods, as derived from pretraining. This unique adaptation ensures that the predictions align with molecular thermodynamic principles, enforcing properties like antisymmetry.

Empirical Renaissance

In empirical evaluations against existing methods, the PPIformer demonstrates superior performance, achieving higher accuracy in mutation effect predictions on both standard datasets and independent case studies. For instance, in scenarios involving SARS-CoV-2 antibody optimization and staphylokinase engineering, PPIformer not only outperformed competitors in retrieval rates but also achieved noteworthy precision in ranking potentially beneficial mutations.

Implications and Future Directions

The contributions of this paper have profound implications for both theoretical and practical aspects of computational biology. The PPIRef dataset offers a robust foundation for future machine learning models aimed at protein engineering, potentially unlocking new paths for therapeutic developments. The modeling practices introduced with PPIformer, encompassing SE(3)-equivariant architecture and thermodynamically informed objective functions, offer a blueprint for designing models that are adept at understanding and predicting molecular interactions in three-dimensional space, a necessity for accurate biological insights.

Moving forward, this work paves the way for exploring broader applications of machine learning in structural bioinformatics, potentially facilitating the design of more accurate predictive tools, and further enhancements of model architectures could improve generalization on a wider spectrum of protein interactions. Additionally, future research may leverage the methodological framework introduced to tackle other biomolecular interactions, such as protein-ligand and protein-DNA/RNA interactions, to foster advancements in drug discovery and development. This paper indeed signifies a substantive advance toward robust AI-mediated bioengineering.

Youtube Logo Streamline Icon: https://streamlinehq.com