Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding (2402.14391v1)

Published 22 Feb 2024 in cs.LG and q-bio.BM

Abstract: Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. To take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the "vocabulary" is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary" (i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
  2. Language modelling for biological sequences–curated datasets and baselines. BioRxiv, 2020.
  3. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  4. Improved prediction of protein-protein interactions using alphafold2. Nature communications, 13(1):1265, 2022.
  5. Structure-aware protein self-supervised learning. arXiv preprint arXiv:2204.04213, 2022.
  6. Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics, 35(14):i305–i314, 2019.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225, 2020.
  9. A novel genetic system to detect protein–protein interactions. Nature, 340(6230):245–246, 1989.
  10. Proteininvbench: Benchmarking protein inverse folding on diverse tasks, models, and metrics. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023a.
  11. Hierarchical graph learning for protein–protein interaction. Nature Communications, 14(1):1093, 2023b.
  12. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics, 34(17):i802–i810, 2018.
  13. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16000–16009, 2022.
  14. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868):180–183, 2002.
  15. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  594–604, 2022.
  16. A survey on computational models for predicting protein–protein interactions. Briefings in bioinformatics, 22(5):bbab036, 2021.
  17. Accurate and efficient protein sequence design through learning concise local environment of residues. Bioinformatics, 39(3):btad122, 2023a.
  18. Protein 3d graph structure learning for robust structure-based protein property prediction. arXiv preprint arXiv:2310.11466, 2023b.
  19. Data-efficient protein 3d geometric pretraining via refinement of diffused protein structure decoy. arXiv preprint arXiv:2302.10888, 2023c.
  20. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  21. What to hide from your students: Attention-guided masked image modeling. In European Conference on Computer Vision, pp.  300–318. Springer, 2022.
  22. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  23. Protein-protein interaction studies using molecular dynamics simulation. In Advanced Methods in Structural Biology, pp.  269–283. Springer, 2023.
  24. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  25. Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35:14290–14302, 2022.
  26. Deep neural network based predictions of protein interactions using primary sequences. Molecules, 23(8):1923, 2018.
  27. Functional-group-based diffusion for pocket-specific molecule generation and elaboration. arXiv preprint arXiv:2306.13769, 2023.
  28. Machine learning approaches for protein–protein interaction hot spot prediction: Progress and comparative assessment. Molecules, 23(10):2535, 2018.
  29. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv, 2020.
  30. Machine learning-aided engineering of hydrolases for pet depolymerization. Nature, 604(7907):662–667, 2022.
  31. Learning unknown from correlations: Graph neural network for inter-novel-protein interaction prediction. arXiv preprint arXiv:2105.06709, 2021.
  32. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  33. Computational close up on protein–protein interactions: how to unravel the invisible using molecular dynamics simulations? Wiley Interdisciplinary Reviews: Computational Molecular Science, 5(5):345–359, 2015.
  34. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  35. Machine-learning techniques for the prediction of protein–protein interactions. Journal of biosciences, 44(4):104, 2019.
  36. Adversarial masking for self-supervised learning. In International Conference on Machine Learning, pp.  20026–20040. PMLR, 2022.
  37. Deciphering protein–protein interactions. part i. experimental techniques and databases. PLoS computational biology, 3(3):e42, 2007.
  38. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research, 47(D1):D607–D613, 2019.
  39. The application of ligand-mapping molecular dynamics simulations to the rational design of peptidic modulators of protein–protein interactions. Journal of chemical theory and computation, 11(7):3199–3210, 2015.
  40. Protein–protein interaction prediction methods: from docking-based to ai-based approaches. Biophysical Reviews, pp.  1–8, 2022.
  41. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  42. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  43. Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation forest. Scientific reports, 9(1):9848, 2019.
  44. Detection of protein-protein interactions from amino acid sequences using a rotation forest model with a novel pr-lpq descriptor. In Advanced Intelligent Computing Theories and Applications: 11th International Conference, ICIC 2015, Fuzhou, China, August 20-23, 2015. Proceedings, Part III 11, pp.  713–720. Springer, 2015.
  45. Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Transactions on Knowledge and Data Engineering, 2021.
  46. A survey on protein representation learning: Retrospect and prospect. arXiv preprint arXiv:2301.00813, 2022a.
  47. Knowledge distillation improves graph structure augmentation for graph neural networks. Advances in Neural Information Processing Systems, 35:11815–11827, 2022b.
  48. Graphmixup: Improving class-imbalanced node classification by reinforcement mixup and self-supervised context prediction. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  519–535. Springer, 2022c.
  49. Quantifying the knowledge in gnns for reliable distillation into mlps. arXiv preprint arXiv:2306.05628, 2023.
  50. Psc-cpi: Multi-scale protein sequence-structure contrasting for efficient and generalizable compound-protein interaction prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  51. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
  52. Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC bioinformatics, 21(1):1–16, 2020.
  53. Masked inverse folding with sequence transfer for protein representation learning. bioRxiv, 2022.
  54. Structure-aware protein–protein interaction site prediction using deep graph convolutional network. Bioinformatics, 38(1):125–132, 2022.
  55. Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing, 357:86–100, 2019a.
  56. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  793–803, 2019b.
  57. Application of machine learning approaches for protein-protein interactions prediction. Medicinal Chemistry, 13(6):506–514, 2017.
  58. Recent advances in protein-protein docking. Current drug targets, 17(14):1586–1594, 2016.
  59. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
  60. Semignn-ppi: Self-ensembling multi-graph neural network for efficient and generalizable protein-protein interaction prediction. arXiv preprint arXiv:2305.08316, 2023.
  61. Protein representation learning via knowledge enhanced primary structure reasoning. In The Eleventh International Conference on Learning Representations, 2022.
  62. Current experimental methods for characterizing protein–protein interactions. ChemMedChem, 11(8):738–756, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Lirong Wu (67 papers)
  2. Yijun Tian (29 papers)
  3. Yufei Huang (81 papers)
  4. Siyuan Li (140 papers)
  5. Haitao Lin (63 papers)
  6. Stan Z. Li (222 papers)
  7. Nitesh V Chawla (13 papers)
Citations (14)
X Twitter Logo Streamline Icon: https://streamlinehq.com