Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks (2403.14736v2)

Published 21 Mar 2024 in q-bio.QM, cs.AI, and cs.LG

Abstract: Protein classification tasks are essential in drug discovery. Real-world protein structures are dynamic, which will determine the properties of proteins. However, the existing machine learning methods, like ProNet (Wang et al., 2022a), only access limited conformational characteristics and protein side-chain features, leading to impractical protein structure and inaccuracy of protein classes in their predictions. In this paper, we propose novel semantic data augmentation methods, Novel Augmentation of New Node Attributes (NaNa), and Molecular Interactions and Geometric Upgrading (MiGu) to incorporate backbone chemical and side-chain biophysical information into protein classification tasks and a co-embedding residual learning framework. Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, and ionic features of proteins to facilitate protein classification tasks. Furthermore, our semantic augmentation methods and the co-embedding residual learning framework can improve the performance of GIN (Xu et al., 2019) on EC and Fold datasets (Bairoch, 2000; Andreeva et al., 2007) by 16.41% and 11.33% respectively. Our code is available at https://github.com/r08b46009/Code_for_MIGU_NANA/tree/main.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Automatic semantic augmentation of language model prompts (for code summarization). In International Conference on Software Engineering (ICSE), 2024.
  2. Data growth and its impact on the scop database: new developments. Nucleic acids research, 2007.
  3. Bairoch, A. The enzyme database in 2000. Nucleic acids research, 2000.
  4. Hydrogen bonding in globular proteins. Progress in biophysics and molecular biology, 1984.
  5. Graphqa: protein model quality assessment using graph convolutional networks. Bioinformatics, 2021.
  6. Bashford, D. Macroscopic electrostatic models for protonation states in proteins. Frontiers in Bioscience, 2004.
  7. The amber biomolecular simulation programs. Journal of computational chemistry, 2005.
  8. Structure-aware protein self-supervised learning. Bioinformatics, 2023.
  9. Viral capsid proteins are segregated in structural fold space. PLoS computational biology, 2013.
  10. Copeland, R. A. Enzymes: a practical introduction to structure, mechanism, and data analysis. John Wiley & Sons, 2023.
  11. Htmd: high-throughput molecular dynamics for molecular discovery. Journal of chemical theory and computation, 2016.
  12. Twenty years on: the impact of fragments on drug discovery. Nature reviews Drug discovery, 2016.
  13. Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. Journal of computational chemistry, 2004a.
  14. Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. Journal of computational chemistry, 2004b.
  15. Knowledge‐based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 1995.
  16. Neural message passing for quantum chemistry. In International conference on machine learning (ICML), 2017.
  17. Structure-based protein function prediction using graph convolutional networks. Nature communications, 2021.
  18. Inductive representation learning on large graphs. Advances in neural information processing systems (NeurIPS), 2017.
  19. Subdivision of c4-pathway species based on differing c4 acid decarboxylating systems and ultrastructural features. Functional Plant Biology, 1975.
  20. Hydrogen bonds in proteins: role and strength. eLS, 2010.
  21. Learning from protein structure with geometric vector perceptrons. International Conference on Learning Representations (ICLR), 2020.
  22. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules, 1983.
  23. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR).
  24. Advantages of proteins being disordered. Protein Science, 2014.
  25. Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. BMC structural biology, 2005.
  26. Hot spots—a review of the protein–protein interface determinant amino-acid residues. Proteins: Structure, Function, and Bioinformatics, 2007.
  27. Protonation and pk changes in protein–ligand binding. Quarterly reviews of biophysics, 2013.
  28. Pyle, A. Metal ions in the structure and function of rna. JBIC Journal of Biological Inorganic Chemistry, 2002.
  29. Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 2018.
  30. Classification of chemical bonds based on topological analysis of electron localization functions. Nature, 1994.
  31. Singh, R. A review of algorithmic techniques for disulfide-bond determination. Briefings in Functional Genomics and Proteomics, 2008.
  32. Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pka values. Journal of chemical theory and computation, 2011.
  33. The nature and applications of π𝜋\piitalic_π–π𝜋\piitalic_π interactions: a perspective. Crystal Growth & Design, 2019.
  34. Effective data augmentation with diffusion models. 2023.
  35. Current and prospective applications of metal ion–protein binding. Journal of chromatography A, 2003.
  36. Learning hierarchical protein representations via complete 3d graph networks. In The Eleventh International Conference on Learning Representations, 2022a.
  37. Comenet: Towards complete and efficient message passing for 3d molecular graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  38. Implicit semantic data augmentation for deep networks. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  39. Whitford, D. Proteins: structure and function. John Wiley & Sons, 2013.
  40. How powerful are graph neural networks? International Conference on Learning Representations (ICLR), 2019.
  41. Where metal ions bind in proteins. Proceedings of the National Academy of Sciences, 1990.
  42. Gnngo3d: Protein function prediction based on 3d structure and functional hierarchy learning. IEEE Transactions on Knowledge and Data Engineering, 2023.
  43. Electrostatic interactions in protein structure, folding, binding, and condensation. Chemical reviews, 2018.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com