SurfPro: Functional Protein Design Based on Continuous Surface (2405.06693v2)
Abstract: How can we design proteins with desired functions? We are motivated by a chemical intuition that both geometric structure and biochemical properties are critical to a protein's function. In this paper, we propose SurfPro, a new method to generate functional proteins given a desired surface and its associated biochemical properties. SurfPro comprises a hierarchical encoder that progressively models the geometric shape and biochemical features of a protein surface, and an autoregressive decoder to produce an amino acid sequence. We evaluate SurfPro on a standard inverse folding benchmark CATH 4.2 and two functional protein design tasks: protein binder design and enzyme design. Our SurfPro consistently surpasses previous state-of-the-art inverse folding methods, achieving a recovery rate of 57.78% on CATH 4.2 and higher success rates in terms of protein-protein binding and enzyme-substrate interaction scores.
- Point set surfaces. In Proceedings Visualization, 2001. VIS’01., pp. 21–29. IEEE, 2001.
- Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019.
- De novo protein design by deep network hallucination. Nature, 600(7889):547–552, 2021.
- Arnold, F. H. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
- Arnold, F. H. Directed evolution: bringing new chemistry to life. Angewandte Chemie International Edition, 57(16):4143–4148, 2018.
- Improving de novo protein binder design with deep learning. Nature Communications, 14(1):2625, 2023.
- Conditioning by adaptive sampling for robust design. In International conference on machine learning, pp. 773–782. PMLR, 2019.
- Design by adaptive sampling. arXiv preprint arXiv:1810.03714, 2018.
- Connolly, M. L. Solvent-accessible surfaces of proteins and nucleic acids. Science, 221(4612):709–713, 1983.
- Dalby, P. A. Strategy and success for the directed evolution of enzymes. Current opinion in structural biology, 21(4):473–480, 2011.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Protein design with deep learning. International Journal of Molecular Sciences, 22(21):11741, 2021.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
- Msms: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 26(16):2064–2065, 2010.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Rosettascripts: a scripting language interface to the rosetta macromolecular modeling suite. PloS one, 6(6):e20161, 2011.
- Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2):184–192, 2020.
- De novo design of protein interactions with learned surface fingerprints. Nature, pp. 1–9, 2023.
- Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
- Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pp. 8946–8970. PMLR, 2022.
- The coming of age of de novo protein design. Nature, 537(7620):320–327, 2016.
- Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
- Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786–9801. PMLR, 2022.
- Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations, 2020.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Adam: A method for stochastic optimization. 2014.
- A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nature Communications, 14(1):2787, 2023a.
- Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nature Communications, 14(1):4139, 2023b.
- Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
- Levin, D. The approximation power of moving least-squares. Mathematics of computation, 67(224):1517–1531, 1998.
- Levin, D. Mesh-independent surface interpolation. In Geometric modeling for scientific visualization, pp. 37–49. Springer, 2004.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Voxel structure-based mesh reconstruction from a 3d point cloud. IEEE Transactions on Multimedia, 24:1815–1829, 2021.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp. 1–8, 2023.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
- Boss: Bayesian optimization over string spaces. Advances in neural information processing systems, 33:15476–15486, 2020.
- Methods for the directed evolution of proteins. Nature Reviews Genetics, 16(7):379–394, 2015.
- Frame averaging for invariant and equivariant network design. In International Conference on Learning Representations, 2021.
- Proximal exploration for model-guided protein sequence design. bioRxiv, 2022.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Reduced surface: an efficient way to compute molecular surfaces. Biopolymers, 38(3):305–320, 1996.
- E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323–9332. PMLR, 2021.
- Octree-based point-cloud compression. PBG@ SIGGRAPH, 3, 2006.
- Importance weighted expectation-maximization for protein sequence design. arXiv preprint arXiv:2305.00386, 2023.
- Fast end-to-end learning on protein surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15272–15281, 2021.
- Black-box optimization for automated discovery. Accounts of Chemical Research, 54(6):1334–1346, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022.
- De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
- De novo design of luciferases using deep learning. Nature, 614(7949):774–780, 2023.
- Structure-informed language models are protein designers. bioRxiv, pp. 2023–02, 2023.