Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2 (2405.15489v1)

Published 24 May 2024 in q-bio.BM and cs.LG

Abstract: Protein diffusion models have emerged as a promising approach for protein design. One such pioneering model is Genie, a method that asymmetrically represents protein structures during the forward and backward processes, using simple Gaussian noising for the former and expressive SE(3)-equivariant attention for the latter. In this work we introduce Genie 2, extending Genie to capture a larger and more diverse protein structure space through architectural innovations and massive data augmentation. Genie 2 adds motif scaffolding capabilities via a novel multi-motif framework that designs co-occurring motifs with unspecified inter-motif positions and orientations. This makes possible complex protein designs that engage multiple interaction partners and perform multiple functions. On both unconditional and conditional generation, Genie 2 achieves state-of-the-art performance, outperforming all known methods on key design metrics including designability, diversity, and novelty. Genie 2 also solves more motif scaffolding problems than other methods and does so with more unique and varied solutions. Taken together, these advances set a new standard for structure-based protein design. Genie 2 inference and training code, as well as model weights, are freely available at: https://github.com/aqlaboratory/genie2.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Correlation of in situ mechanosensitive responses of the Moraxella catarrhalis adhesin UspA1 with fibronectin and receptor CEACAM1 binding. Proceedings of the National Academy of Sciences, 108(37):15174–15178, 2011.
  2. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pages 2023–09, 2023.
  3. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
  4. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
  5. Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983):637–645, 2023.
  6. The Protein Data Bank. Acta Crystallographica Section D: Biological Crystallography, 58(6):899–907, 2002.
  7. Computational design of a synthetic PD-1 agonist. Proceedings of the National Academy of Sciences, 118(29):e2102164118, 2021.
  8. RCSB Protein Data Bank (RCSB. org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic acids research, 51(D1):D488–D508, 2023.
  9. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. arXiv preprint arXiv:2402.04997, 2024.
  10. De novo design of picomolar SARS-CoV-2 miniprotein inhibitors. Science, 370(6515):426–431, 2020.
  11. Accurate single domain scaffolding of three non-overlapping protein epitopes using deep learning. bioRxiv, pages 2024–05, 2024.
  12. De novo metalloprotein design. Nature Reviews Chemistry, 6(1):31–50, 2022.
  13. Patrick Chène. Inhibiting the p53–MDM2 interaction: an important target for cancer therapy. Nature reviews cancer, 3(2):102–109, 2003.
  14. The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic acids research, 51(D1):D523–D531, 2023.
  15. Ophiuchus: Scalable modeling of protein structures through hierarchical coarse-graining SO(3)-equivariant autoencoders. arXiv preprint arXiv:2310.02508, 2023.
  16. Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615):49–56, 2022.
  17. A framework for conditional diffusion modelling with applications in motif scaffolding for protein design. arXiv preprint arXiv:2312.09236, 2023.
  18. Engineering protein-based therapeutics through structural and chemical design. Nature Communications, 14(1):2411, 2023.
  19. William Falcon and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
  20. A closed compact structure of native Ca(2+)-calmodulin. Structure, 11(10):1303–1307, 2003.
  21. A latent diffusion model for protein structure generation. In Learning on Graphs Conference, pages 29–1. PMLR, 2024.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Blueprinting extendable nanomaterials with standardized protein blocks. Nature, 627(8005):898–904, 2024.
  24. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, 2023.
  25. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
  26. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
  27. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  28. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. In Proceedings of the 40th International Conference on Machine Learning, pages 20978–21002, 2023.
  29. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  30. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
  31. Computational design of novel protein–protein interactions – An overview on methodological approaches and applications. Current Opinion in Structural Biology, 74:102370, 2022.
  32. De novo design of modular and tunable protein biosensors. Nature, 591(7850):482–487, 2021.
  33. Interleukin-2 superkines by computational design. Proceedings of the National Academy of Sciences, 119(12):e2117401119, 2022.
  34. Unlocking de novo antibody design with generative artificial intelligence. bioRxiv, pages 2023–01, 2023.
  35. De novo design of potent and selective mimics of IL-2 and IL-15. Nature, 565(7738):186–191, 2019.
  36. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  37. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
  38. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, 42(2):243–246, 2024.
  39. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  40. Proteus: exploring protein structure generation for enhanced designability and efficiency. bioRxiv, pages 2024–02, 2024.
  41. Scaffolding protein functional sites using deep learning. Science, 377(6604):387–394, 2022.
  42. De novo design of protein structure and function with RFdiffusion. Nature, 620(7976):1089–1100, 2023.
  43. Protein structure generation via folding diffusion. Nature Communications, 15(1):1059, 2024a.
  44. Practical and asymptotically exact conditional sampling in diffusion models. Advances in Neural Information Processing Systems, 36, 2024b.
  45. How significant is a protein structure similarity with TM-score= 0.5? Bioinformatics, 26(7):889–895, 2010.
  46. Bottom-up de novo design of functional proteins with complex structural features. Nature Chemical Biology, 17(4):492–500, 2021.
  47. Fast protein backbone generation with SE(3) flow matching. arXiv preprint arXiv:2310.05297, 2023a.
  48. SE(3) diffusion model with application to protein backbone generation. In Proceedings of the 40th International Conference on Machine Learning, pages 40001–40039, 2023b.
  49. Improved motif-scaffolding with SE(3) flow matching. arXiv preprint arXiv:2401.04082, 2024.
  50. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
  51. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic acids research, 33(7):2302–2309, 2005.
Citations (7)

Summary

  • The paper introduces Genie 2's novel multi-motif scaffolding approach, achieving superior design diversity and novelty compared to previous models.
  • The paper demonstrates enhanced protein structure generation through extensive data augmentation with AlphaFold predictions, expanding structural coverage.
  • The paper validates Genie 2's performance in both unconditional and multi-motif tasks, outperforming benchmarks in designability and yielding more unique solutions.

Analysis of Genie 2: Advancements in Structure-based Protein Design

Introduction

The field of protein design stands at the cusp of significant transformation, driven by generative AI methodologies such as diffusion models and flow matching techniques. These advancements have been catalyzed by pioneering works such as AlphaFold 2’s revolution in structural prediction. One notable model in the landscape of protein design is Genie, leveraging SE(3)-equivariant attention for robust structural representation. The present paper discusses Genie 2, an advanced iteration that enhances protein structure capture through innovative architectural modifications and large-scale data augmentation, setting new standards in the domain of structure-based protein design.

Innovations in Genie 2

Genie 2 introduces several key enhancements over its predecessor. Central to these improvements is a novel multi-motif framework that extends motif scaffolding capabilities. This new architecture supports the design of complex proteins engaging multiple interaction partners with unspecified inter-motif positions and orientations. Genie 2 differentiates itself by employing conditional and unconditional generation techniques, which have shown superior design metrics performance.

Architectural Enhancements

The original Genie utilized asymmetric protein representations during the forward and backward processes, involving Gaussian noising and SE(3)-equivariant attention, respectively. Genie 2 enhances this framework by incorporating a multi-motif scaffolding approach that handles motifs without pre-defined interrelationships. This flexibility paves the way for designing proteins capable of multiple functions or interactions, an advancement addressing limitations in current models.

Data Augmentation Strategies

Acknowledging the constraints of the Protein Data Bank (PDB) in providing comprehensive structural data, Genie 2 integrates confidently predicted protein structures from the AlphaFold database (AFDB). This augmentation amplifies the model’s training set, enabling it to capture a broader structural space, which is instrumental in achieving higher designability, diversity, and novelty metrics.

Performance Evaluation

Genie 2 was rigorously assessed against key protein design models such as Chroma, FrameFlow, and RFDiffusion across multiple criteria, including designability, diversity, novelty, and multi-motif scaffolding.

Unconditional Protein Generation

In unconditional generation tasks, Genie 2 demonstrated remarkable performance, achieving a designability score equivalent to RFDiffusion but with significantly higher diversity and novelty. The structure generation showcased a wide range of secondary structure elements, albeit with some bias towards helical structures, likely due to the training dataset's composition. Nonetheless, the model’s capability to generate structurally diverse proteins was evident, outperforming competitive models especially in short sequence lengths, which comprise a smaller design space.

Single and Multi-Motif Scaffolding

Genie 2 excelled in single-motif scaffolding, outperforming RFDiffusion across 24 design tasks. The paper highlighted that Genie 2 achieved a higher number of unique solutions, with the performance gap enlarging with increased sample size. This suggests Genie 2’s superior ability to capture a diverse protein structure space. Furthermore, Genie 2 was evaluated on six multi-motif scaffolding tasks, solving four of them successfully. This demonstrates its proficiency in tackling complex design problems involving multiple functional motifs.

Implications and Future Directions

The advancements embodied by Genie 2 have significant practical and theoretical implications. From a practical perspective, the model's robust performance in designability, diversity, and novelty makes it a potent tool for therapeutic and industrial applications, such as developing new enzymes, biosensors, and multi-functional proteins. Theoretically, the ability to scaffold multiple motifs without specifying inter-motif geometry suggests new avenues in protein architecture design and function prediction.

Future work could explore further integration of sequence-based information into the structural design process, enabling a more seamless sequence-structure-function relationship. Additionally, improvements in training datasets, incorporating more diverse and experimentally validated structures, could enhance the robustness and applicability of such models. Moreover, expanding the capabilities to include protein-protein interaction modeling could provide comprehensive solutions for designing complex macromolecular assemblies.

Conclusion

Genie 2 represents a significant enhancement in the domain of generative protein design models. By combining architectural innovations with extensive data augmentation, it sets a new benchmark for structure-based protein design methodologies. The model's performance across various design metrics underlines its potential to transcend traditional limitations, offering a versatile and powerful tool for advancing both the understanding and application of protein science.

Youtube Logo Streamline Icon: https://streamlinehq.com