Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates (2405.08205v3)
Abstract: Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.
- Suitemsa: visual tools for multiple sequence alignment comparison and molecular sequence simulation. BMC bioinformatics, 12(1):1–14, 2011.
- De novo protein design by deep network hallucination. Nature, 600(7889):547–552, 2021.
- Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. Journal of molecular biology, 307(1):447–463, 2001.
- Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.
- The moderately efficient enzyme: evolutionary and physicochemical trends shaping enzyme parameters. Biochemistry, 50(21):4402–4410, 2011.
- Finding important sites in protein sequences. Proceedings of the National Academy of Sciences, 99(23):14764–14771, 2002.
- Alphafold illuminates half of the dark human proteins. Current Opinion in Structural Biology, 74:102372, 2022.
- Engineering novel binding proteins from nonimmunoglobulin domains. Nature biotechnology, 23(10):1257–1268, 2005.
- Sitesidentify: a protein functional site prediction tool. BMC bioinformatics, 10(1):1–12, 2009.
- Conditioning by adaptive sampling for robust design. In International conference on machine learning, pp. 773–782. PMLR, 2019.
- Design by adaptive sampling. arXiv preprint arXiv:1810.03714, 2018.
- Selenzyme: enzyme selection tool for pathway design. Bioinformatics, 34(12):2153–2154, 2018.
- Analysis and prediction of functionally important sites in proteins. Protein Science, 16(1):4–13, 2007.
- Cross-lingual natural language generation via pre-training. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7570–7577, 2020.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.
- Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613–623, 2021.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Learning meaningful representations of protein sequences. Nature communications, 13(1):1914, 2022.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and t1ribonucleases. Journal of molecular biology, 281(5):949–968, 1998.
- Genomic-scale comparison of sequence-and structure-based methods of function prediction: Does structure provide additional insight? Protein science, 10(5):1005–1014, 2001.
- Computational design of antibodies. Current opinion in structural biology, 51:156–162, 2018.
- Therapeutic enzyme engineering using a generative neural network. Scientific Reports, 12(1):1536, 2022.
- 3d equivariant diffusion for target-aware molecule generation and affinity prediction. In The Eleventh International Conference on Learning Representations, 2022.
- Alphafold2 models indicate that protein sequence determines both structure and dynamics. Scientific Reports, 12(1):10696, 2022.
- Optimizing molecules using efficient queries from property evaluations. Nature Machine Intelligence, 4(1):21–31, 2022.
- Pithia: Protein interaction site prediction using multiple sequence alignments and attention. International Journal of Molecular Sciences, 23(21):12814, 2022.
- Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pp. 8946–8970. PMLR, 2022.
- The coming of age of de novo protein design. Nature, 537(7620):320–327, 2016.
- The prosite database. Nucleic acids research, 34(suppl_1):D227–D230, 2006.
- Interpro: the integrative protein signature database. Nucleic acids research, 37(suppl_1):D211–D215, 2009.
- Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
- Pdbsite: a database of the 3d structure of protein functional sites. Nucleic Acids Research, 33(suppl_1):D183–D187, 2005.
- Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786–9801. PMLR, 2022.
- Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nature Communications, 14(1):2787, 2023.
- Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
- Crystal structure of the integral membrane diacylglycerol kinase. Nature, 497(7450):521–524, 2013.
- Fuelling the future: microbial engineering for the production of sustainable biofuels. Nature Reviews Microbiology, 14(5):288–304, 2016.
- De novo peptide and protein design using generative adversarial networks: an update. Journal of Chemical Information and Modeling, 62(4):761–774, 2022a.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022b.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp. 1–8, 2023.
- Gnina 1.0: molecular docking with deep learning. Journal of cheminformatics, 13(1):1–20, 2021.
- Benchmarking deep generative models for diverse antibody sequence design. arXiv preprint arXiv:2111.06801, 2021.
- Prediction of functional sites by analysis of sequence and structure conservation. Protein science, 13(4):884–892, 2004.
- Deep learning techniques have significantly impacted protein structure prediction and protein design. Current opinion in structural biology, 68:194–207, 2021.
- Proximal exploration for model-guided protein sequence design. bioRxiv, 2022.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323–9332. PMLR, 2021.
- Brenda, enzyme data and metabolic information. Nucleic acids research, 30(1):47–49, 2002.
- Brenda, the enzyme database: updates and major new developments. Nucleic acids research, 32(suppl_1):D431–D433, 2004.
- Protein sequence and structure co-design with equivariant translation. In The Eleventh International Conference on Learning Representations, 2022.
- Importance weighted expectation-maximization for protein sequence design. arXiv preprint arXiv:2305.00386, 2023.
- switch-glat: Multilingual parallel machine translation via code-switch decoder. In International Conference on Learning Representations, 2021.
- Functional geometry guided protein sequence and backbone structure co-design. arXiv preprint arXiv:2310.04343, 2023.
- Fast and flexible protein design using deep graph neural networks. Cell systems, 11(4):402–411, 2020.
- Improving protein expression, stability, and function with proteinmpnn. bioRxiv, pp. 2023–10, 2023.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. In The Eleventh International Conference on Learning Representations, 2022.
- Tristem, M. Molecular evolution—a phylogenetic approach. Heredity, 84(1):131–131, 2000.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Integration of molecular docking analysis and molecular dynamics simulations for studying food proteins and bioactive peptides. Journal of Agricultural and Food Chemistry, 70(4):934–943, 2022.
- Computational protein design with deep learning neural networks. Scientific reports, 8(1):1–9, 2018.
- Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022.
- Self-play reinforcement learning guides protein engineering. Nature Machine Intelligence, 5(8):845–860, 2023.
- De novo design of protein structure and function with rfdiffusion. Nature, pp. 1–3, 2023.
- Bioengineering natural product biosynthetic pathways for therapeutic applications. Current opinion in biotechnology, 23(6):931–940, 2012.
- De novo design of luciferases using deep learning. Nature, 614(7949):774–780, 2023.
- Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31, 2018.
- Structure-informed language models are protein designers. In International conference on machine learning, 2023.