Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SE(3)-Stochastic Flow Matching for Protein Backbone Generation (2310.02391v4)

Published 3 Oct 2023 in cs.LG and cs.AI

Abstract: The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce FoldFlow, a series of novel generative models of increasing modeling power based on the flow-matching paradigm over $3\mathrm{D}$ rigid motions -- i.e. the group $\text{SE}(3)$ -- enabling accurate modeling of protein backbones. We first introduce FoldFlow-Base, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on $\text{SE}(3)$. We next accelerate training by incorporating Riemannian optimal transport to create FoldFlow-OT, leading to the construction of both more simple and stable flows. Finally, we design FoldFlow-SFM, coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over $\text{SE}(3)$. Our family of FoldFlow, generative models offers several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over $\text{SE}(3)$. Empirically, we validate FoldFlow, on protein backbone generation of up to $300$ amino acids leading to high-quality designable, diverse, and novel samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Normalizing flows for lattice gauge theory in arbitrary space-time dimension (2023). arXiv preprint arXiv:2305.02402, 2023.
  2. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022. doi: 10.1101/2022.11.20.517210.
  3. Improved inverse scaling and squaring algorithms for the matrix logarithm. SIAM Journal on Scientific Computing, 34(4):C153–C169, 2012.
  4. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp.  2023–09, 2023.
  5. Building normalizing flows with stochastic interpolants. International Conference on Learning Representations (ICLR), 2023.
  6. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint 2303.08797, 2023.
  7. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models, 2022.
  8. Wasserstein gan, 2017.
  9. Matching normalizing flows and probability paths on manifolds. arXiv preprint arXiv:2207.04711, 2022.
  10. The protein data bank. Nucleic Acids Research, 28(1):235–242, January 2000. ISSN 0305-1048.
  11. Equivariant finite normalizing flows. arXiv preprint arXiv:2110.08649, 2021.
  12. Sampling using s⁢u⁢(n)𝑠𝑢𝑛su(n)italic_s italic_u ( italic_n ) gauge equivariant flows. arXiv preprint arXiv:2008.05456, 2020.
  13. Edgi: Equivariant diffusion for planning with embodied agents. Neural Information Processing Systems (2023), 2023.
  14. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44:375–417, 1991.
  15. Manifold density estimation via generalized dequantization. 2021.
  16. De novo design of picomolar sars-cov-2 miniprotein inhibitors. Science, 370(6515):426–431, 2020a.
  17. De novo design of picomolar sars-cov-2 miniprotein inhibitors. Science, 370(6515):426–431, 2020b.
  18. Riemannian flow matching on general geometries. arXiv preprint arXiv:2302.03660, 2023.
  19. Riemannian Convex Potential Maps. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  2028–2038. PMLR, 18–24 Jul 2021.
  20. DeepJDOT: Deep joint distribution optimal transport for unsupervised domain adaptation. European Conference on Computer Vision (ECCV), 2018.
  21. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  22. Diffusion schr\"odinger bridge with applications to score-based generative modeling. NeurIPS, 2021.
  23. Riemannian score-based generative modelling. Advances in Neural Information Processing Systems, 35:2406–2422, 2022.
  24. Density estimation using real nvp. In The 5th International Conference on Learning Representations (ICLR), Vancouver, 2017.
  25. Joseph Doob. Classical Potential Theory and Its Probabilistic Counterpart, volume 549. Springer, 1984.
  26. RA Engh and R Huber. Structure quality and target parameters. 2012.
  27. Learning with minibatch wasserstein : asymptotic and gradient properties. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp.  2131–2141, Online, 26–28 Aug 2020. PMLR.
  28. Unbalanced minibatch optimal transport; applications to domain adaptation. 139:3186–3197, 18–24 Jul 2021a.
  29. Minibatch optimal transport distances; analysis and applications, 2021b.
  30. How to train your neural ode: The world of jacobian and kinetic regularization. International Conference on Machine Learning (ICML), 2020.
  31. Pot: Python optimal transport. The Journal of Machine Learning Research, 22(1):3571–3578, 2021.
  32. Computational design of proteins targeting the conserved stem region of influenza hemagglutinin. Science, 332(6031):816–821, 2011.
  33. De novo design of protein interactions with learned surface fingerprints. Nature, pp.  1–9, 2023.
  34. E (n) equivariant normalizing flows. Advances in Neural Information Processing Systems, 34:4181–4192, 2021.
  35. Learning generative models with Sinkhorn divergences. Artificial Intelligence and Statistics (AISTATS), 2018.
  36. Brian C Hall. Lie groups, Lie algebras, and representations. Springer, 2013.
  37. A high-level programming language for generative protein design. bioRxiv, pp.  2022–12, 2022.
  38. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pp. 8867–8887. PMLR, 2022.
  39. Riemannian diffusion models. Advances in Neural Information Processing Systems, 35:2750–2761, 2022.
  40. The coming of age of de novo protein design. Nature, 537(7620):320–327, 2016.
  41. Equivariant 3d-conditional diffusion models for molecular linker design. arXiv preprint arXiv:2210.05274, 2022.
  42. Bridge simulation and metric estimation on lie groups and homogeneous spaces. 2022.
  43. Unsupervised protein-ligand binding energy prediction via neural euler’s rotation equation. Neural Information Processing Systems (NeurIPS), 2023.
  44. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  45. Equivariant flow-based sampling for lattice gauge theory. Physical Review Letters, 125(12):121601, 2020.
  46. Equivariant manifold flows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  47. Equivariant flow matching. arXiv preprint arXiv:2306.15030, 2023.
  48. Equivariant flows: exact likelihood generative learning for symmetric densities. In International conference on machine learning, pp. 5361–5370. PMLR, 2020.
  49. Denoising diffusion probabilistic models on so (3) for rotational alignment. In ICLR 2022 Workshop on Geometrical and Topological Representation Learning, 2022.
  50. John M Lee. Introduction to Riemannian manifolds. Springer, 2018.
  51. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds, 2023.
  52. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  53. Flow matching for generative modeling, October 2022.
  54. I22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTsb: Image-to-image schrödinger bridge. International Conference on Machine Learning (ICML), 2023a.
  55. Learning diffusion bridges on constrained domains. In The Eleventh International Conference on Learning Representations, 2022.
  56. Flow straight and fast: Learning to generate and transfer data with rectified flow. International Conference on Learning Representations (ICLR), 2023b.
  57. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp.  1–8, 2023.
  58. Riemannian continuous normalizing flows. Advances in Neural Information Processing Systems, 33:2503–2515, 2020.
  59. On the normal distribution in the orientation space. Textures and Microstructures, 10:595067, Jan 1900. ISSN 1687-5397.
  60. Phage antibodies: filamentous phage displaying antibody variable domains. nature, 348(6301):552–554, 1990.
  61. Robert J. McCann. Polar factorization of maps on riemannian manifolds. Geometric & Functional Analysis GAFA, 11:589–608, 2001.
  62. Se (3) equivariant augmented coupling flows. arXiv preprint arXiv:2308.10364, 2023.
  63. Normal distribution on the rotation group so(3). Textures and Microstructures, 29:173236, Jan 1900. ISSN 1687-5397.
  64. Kinematic dexterity of robotic mechanisms. The International Journal of Robotics Research, 13(1):1–15, 1994.
  65. Nanosecond to microsecond protein dynamics probed by magnetic relaxation dispersion of buried water molecules. Journal of the American Chemical Society, 130(5):1774–1787, 2008. doi: 10.1021/ja0775873.
  66. Computational Optimal Transport. arXiv:1803.00567, 2019.
  67. David Pollard. A user’s guide to measure theoretic probability. Number 8. Cambridge University Press, 2002.
  68. Multisample flow matching: Straightening flows with minibatch couplings. International Conference on Learning Representations (ICLR), 2023.
  69. Yu Qiu. Isotropic Distributions for 3-dimension Rotations and One-sample Bayes Inference. PhD thesis, Iowa State University, 2013.
  70. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
  71. Kemp elimination catalysts by computational enzyme design. Nature, 453(7192):190–195, 2008.
  72. Same same but differnet: Semi-supervised defect detection with normalizing flows. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  1907–1916, 2021.
  73. Improving gans using optimal transport, 2018.
  74. Atomic-level characterization of the structural dynamics of proteins. Science, 330(6002):341–346, 2010.
  75. Diffusion Schrödinger bridge matching. arXiv preprint 2303.16852, 2023.
  76. De novo design of potent and selective mimics of il-2 and il-15. Nature, 565(7738):186–191, 2019.
  77. Aligned diffusion Schrödinger bridges. arXiv preprint 2302.11419, 2023.
  78. Computational design of trimeric influenza-neutralizing proteins targeting the hemagglutinin receptor binding site. Nature biotechnology, 35(7):667–671, 2017.
  79. TrajectoryNet: A dynamic optimal transport network for modeling cellular dynamics. International Conference on Machine Learning (ICML), 2020.
  80. Simulation-free schrödinger bridges via score and flow matching, 2023a.
  81. Improving and generalizing flow-based generative models with minibatch optimal transport. arXiv preprint 2302.00482, 2023b.
  82. Language models generalize beyond natural proteins. bioRxiv, pp.  2022–12, 2022.
  83. Villani. Topics in Optimal Transportation. American Mathematical Society, 2003.
  84. C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509.
  85. Directed evolution: methodologies and applications. Chemical reviews, 121(20):12384–12444, 2021.
  86. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023. ISSN 1476-4687.
  87. Making antibodies by phage display technology. Annual review of immunology, 12(1):433–455, 1994.
  88. Protein structure generation via folding diffusion. 2022a.
  89. High-resolution de novo structure prediction from primary sequence. bioRxiv, 2022b.
  90. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
  91. Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687–694, 2019.
  92. Fast protein backbone generation with se(3) flow matching, 2023a.
  93. Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023b.
  94. Towards predicting equilibrium distributions for molecular systems with deep learning. arXiv preprint arXiv:2306.05445, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Avishek Joey Bose (29 papers)
  2. Tara Akhound-Sadegh (8 papers)
  3. Kilian Fatras (18 papers)
  4. Guillaume Huguet (15 papers)
  5. Jarrid Rector-Brooks (19 papers)
  6. Cheng-Hao Liu (13 papers)
  7. Andrei Cristian Nica (3 papers)
  8. Maksym Korablyov (10 papers)
  9. Michael Bronstein (77 papers)
  10. Alexander Tong (40 papers)
Citations (47)

Summary

  • The paper introduces the novel FoldFlow framework that combines deterministic continuous-time dynamics, simulation-free stochastic training, and Riemannian optimal transport for protein backbone generation.
  • It demonstrates that FoldFlow models can generate protein backbones of up to 300 amino acids more efficiently and stably than traditional diffusion-based methods.
  • The work offers significant implications for drug design and protein engineering by enabling rapid, diversified, and designable protein structure generation.

Overview of SE(3)SE(3) Stochastic Flow Matching for Protein Backbone Generation

The paper "SE(3)SE(3) Stochastic Flow Matching for Protein Backbone Generation" presents novel methodologies for the computational design of protein structures, a task that holds significant promise across scientific domains, including drug design and therapeutic development. The work introduces the FoldFlow model family, leveraging the structural group SE(3)SE(3) to accurately generate protein backbones. The authors target three primary innovations: deterministic continuous-time dynamics, Riemannian optimal transport (OT), and simulation-free stochastic training. These adaptations collectively advance the generative modeling landscape beyond standard diffusion-based approaches, which the authors argue are less stable and slower to converge.

FoldFlow encompasses three distinct models—FoldFlow-Base, FoldFlow-OT, and FoldFlow-SFM—each progressively more sophisticated in capturing the geometric nuances of SE(3)SE(3). The foundational model, FoldFlow-Base, employs simulation-free training to learn deterministic continuous-time dynamics. This is augmented in FoldFlow-OT with Riemannian optimal transport, aiming to simplify and stabilize the generative flows. The most comprehensive model, FoldFlow-SFM, combines stochastic continuous-time dynamics with both the prior innovations, thereby integrating Riemannian optimal transport and simulation-free training methodologies.

Strong Numerical Results and Claims

The authors deliver empirical validations showing FoldFlow’s capability to generate protein backbones of up to 300 amino acids efficiently. The results highlight the designability, diversity, and novelty of samples generated using FoldFlow. Specific metrics validated against standard baselines illustrate FoldFlow models outperforming diffusion-based approaches across these parameters. The development of Conditional Flow Matching (CFM) methods - allowing mapping from any invariant source distribution to any invariant target distribution within SE(3)SE(3) - is noted as a key feature enabling higher processing speeds and stability. Furthermore, stochastic aspects were emphasized in FoldFlow-SFM to accommodate biological variability inherent in protein structures.

Implications and Future Directions

The practical implications span across computational biology and protein engineering, offering a robust toolset for rational protein design aligned with evolutionary biology's methodologies. Theoretically, FoldFlow models provide structured generative architectures that may inspire future research across manifold-based deep learning, stochastic processes, and geometric machine learning.

Given these advancements, pathways for future exploration include broadening the models' application to include conditional generation, thus allowing more targeted protein engineering tasks. Moreover, integrating sequence-level data with structure could propel the models into more sophisticated design spaces—aligning with biological functionalities. Additionally, advancing these models' capacity to handle larger protein complexes and detailed interactions could further augment their utility and alignment with empirical biological processes.

In conclusion, FoldFlow underscores a meaningful leap in protein structure modeling, aligning computational capabilities with biological insights adaptively and efficiently. These contributions unravel vast possibilities in pharmacology, biotechnology, and synthetic biology by supporting informed rational design strategies.

Youtube Logo Streamline Icon: https://streamlinehq.com