Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2 (2405.15489v1)
Abstract: Protein diffusion models have emerged as a promising approach for protein design. One such pioneering model is Genie, a method that asymmetrically represents protein structures during the forward and backward processes, using simple Gaussian noising for the former and expressive SE(3)-equivariant attention for the latter. In this work we introduce Genie 2, extending Genie to capture a larger and more diverse protein structure space through architectural innovations and massive data augmentation. Genie 2 adds motif scaffolding capabilities via a novel multi-motif framework that designs co-occurring motifs with unspecified inter-motif positions and orientations. This makes possible complex protein designs that engage multiple interaction partners and perform multiple functions. On both unconditional and conditional generation, Genie 2 achieves state-of-the-art performance, outperforming all known methods on key design metrics including designability, diversity, and novelty. Genie 2 also solves more motif scaffolding problems than other methods and does so with more unique and varied solutions. Taken together, these advances set a new standard for structure-based protein design. Genie 2 inference and training code, as well as model weights, are freely available at: https://github.com/aqlaboratory/genie2.
- Correlation of in situ mechanosensitive responses of the Moraxella catarrhalis adhesin UspA1 with fibronectin and receptor CEACAM1 binding. Proceedings of the National Academy of Sciences, 108(37):15174–15178, 2011.
- Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pages 2023–09, 2023.
- Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
- Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983):637–645, 2023.
- The Protein Data Bank. Acta Crystallographica Section D: Biological Crystallography, 58(6):899–907, 2002.
- Computational design of a synthetic PD-1 agonist. Proceedings of the National Academy of Sciences, 118(29):e2102164118, 2021.
- RCSB Protein Data Bank (RCSB. org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic acids research, 51(D1):D488–D508, 2023.
- Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. arXiv preprint arXiv:2402.04997, 2024.
- De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science, 370(6515):426–431, 2020.
- Accurate single domain scaffolding of three non-overlapping protein epitopes using deep learning. bioRxiv, pages 2024–05, 2024.
- De novo metalloprotein design. Nature Reviews Chemistry, 6(1):31–50, 2022.
- Patrick Chène. Inhibiting the p53–MDM2 interaction: an important target for cancer therapy. Nature reviews cancer, 3(2):102–109, 2003.
- The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic acids research, 51(D1):D523–D531, 2023.
- Ophiuchus: Scalable modeling of protein structures through hierarchical coarse-graining SO(3)-equivariant autoencoders. arXiv preprint arXiv:2310.02508, 2023.
- Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615):49–56, 2022.
- A framework for conditional diffusion modelling with applications in motif scaffolding for protein design. arXiv preprint arXiv:2312.09236, 2023.
- Engineering protein-based therapeutics through structural and chemical design. Nature Communications, 14(1):2411, 2023.
- William Falcon and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
- A closed compact structure of native Ca(2+)-calmodulin. Structure, 11(10):1303–1307, 2003.
- A latent diffusion model for protein structure generation. In Learning on Graphs Conference, pages 29–1. PMLR, 2024.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Blueprinting extendable nanomaterials with standardized protein blocks. Nature, 627(8005):898–904, 2024.
- Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, 2023.
- Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
- Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. In Proceedings of the 40th International Conference on Machine Learning, pages 20978–21002, 2023.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Computational design of novel protein–protein interactions – An overview on methodological approaches and applications. Current Opinion in Structural Biology, 74:102370, 2022.
- De novo design of modular and tunable protein biosensors. Nature, 591(7850):482–487, 2021.
- Interleukin-2 superkines by computational design. Proceedings of the National Academy of Sciences, 119(12):e2117401119, 2022.
- Unlocking de novo antibody design with generative artificial intelligence. bioRxiv, pages 2023–01, 2023.
- De novo design of potent and selective mimics of IL-2 and IL-15. Nature, 565(7738):186–191, 2019.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
- Fast and accurate protein structure search with Foldseek. Nature Biotechnology, 42(2):243–246, 2024.
- AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- Proteus: exploring protein structure generation for enhanced designability and efficiency. bioRxiv, pages 2024–02, 2024.
- Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022.
- De novo design of protein structure and function with RFdiffusion. Nature, 620(7976):1089–1100, 2023.
- Protein structure generation via folding diffusion. Nature Communications, 15(1):1059, 2024a.
- Practical and asymptotically exact conditional sampling in diffusion models. Advances in Neural Information Processing Systems, 36, 2024b.
- How significant is a protein structure similarity with TM-score= 0.5? Bioinformatics, 26(7):889–895, 2010.
- Bottom-up de novo design of functional proteins with complex structural features. Nature Chemical Biology, 17(4):492–500, 2021.
- Fast protein backbone generation with SE(3) flow matching. arXiv preprint arXiv:2310.05297, 2023a.
- SE(3) diffusion model with application to protein backbone generation. In Proceedings of the 40th International Conference on Machine Learning, pages 40001–40039, 2023b.
- Improved motif-scaffolding with SE(3) flow matching. arXiv preprint arXiv:2401.04082, 2024.
- Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
- TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic acids research, 33(7):2302–2309, 2005.