Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion (2401.06151v1)

Published 21 Dec 2023 in q-bio.BM, cs.AI, cs.LG, and q-bio.QM

Abstract: Generative models of macromolecules carry abundant and impactful implications for industrial and biomedical efforts in protein engineering. However, existing methods are currently limited to modeling protein structures or sequences, independently or jointly, without regard to the interactions that commonly occur between proteins and other macromolecules. In this work, we introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex, using joint SE(3)-discrete diffusion noise. Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences. We demonstrate the utility of MMDiff through a rigorous new design benchmark for macromolecular complex generation that we introduce in this work. Our results demonstrate that MMDiff is able to successfully generate micro-RNA and single-stranded DNA molecules while being modestly capable of joint modeling DNA and RNA molecules in interaction with multi-chain protein complexes. Source code: https://github.com/Profluent-Internships/MMDiff.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
  2. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/958c530554f78bcd8e97125b70e6973d-Paper.pdf.
  3. Accurate prediction of nucleic acid and protein-nucleic acid complexes using rosettafoldna. bioRxiv, pages 2022–09, 2022.
  4. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  5. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
  6. Ig-vae: Generative modeling of protein structure by direct 3d coordinate generation. PLoS computational biology, 18(6):e1010271, 2022.
  7. Protein complex prediction with alphafold-multimer. biorxiv, pages 2021–10, 2021.
  8. A latent diffusion model for protein structure generation. arXiv preprint arXiv:2305.04120, 2023.
  9. Geometric parameters in nucleic acids: sugar and phosphate constituents. Journal of the American Chemical Society, 118(3):519–529, 1996.
  10. Adaptive machine learning for protein engineering. Current Opinion in Structural Biology, 72:145–152, 2022. ISSN 0959-440X. doi: https://doi.org/10.1016/j.sbi.2021.11.002. URL https://www.sciencedirect.com/science/article/pii/S0959440X21001457.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  13. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pages 8867–8887. PMLR, 2022.
  14. Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pages 8946–8970. PMLR, 2022.
  15. Graphein-a python library for geometric deep learning and network analysis on biomolecular structures and interaction networks. Advances in Neural Information Processing Systems, 35:27153–27167, 2022.
  16. Multi-state rna design with geometric multi-graph neural networks. arXiv preprint arXiv:2305.14749, 2023.
  17. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  18. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  19. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv preprint arXiv:2301.12485, 2023.
  20. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022.
  21. Joint generation of protein sequence and structure with rosettafold sequence space diffusion. bioRxiv, pages 2023–05, 2023.
  22. Stefan Lutz. Beyond directed evolution—semi-rational protein engineering and design. Current Opinion in Biotechnology, 21(6):734–743, 2010. ISSN 0958-1669. doi: https://doi.org/10.1016/j.copbio.2010.08.011. URL https://www.sciencedirect.com/science/article/pii/S0958166910001540. Chemical biotechnology – Pharmaceutical biotechnology.
  23. E2efold-3d: end-to-end deep learning method for accurate de novo rna 3d structure prediction. arXiv preprint arXiv:2207.01586, 2022.
  24. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
  25. De novo design of protein structure and function with rfdiffusion. Nature, pages 1–3, 2023.
  26. Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022.
  27. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, pages 38592–38610. PMLR, 2023.
  28. Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023.
  29. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 19(9):1109–1115, 2022.
Citations (8)

Summary

  • The paper introduces MMDiff, a framework that integrates SE(3)-discrete diffusion to jointly generate sequences and structures of nucleic acid and protein complexes.
  • The paper demonstrates that MMDiff achieves notable designability, diversity, and novelty by surpassing a random generative baseline in key structural metrics.
  • The paper highlights limitations in protein-nucleic acid complex generation, paving the way for future work with larger datasets and full-atom modeling improvements.

An Essay on Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes Using SE(3)-Discrete Diffusion

This paper introduces MMDiff, a model designed to improve the generation of sequences and structures of nucleic acid and protein complexes. Traditional methods focusing on macromolecular design often treat protein sequences and structures independently or only focus on proteins without considering interactions with other macromolecules such as nucleic acids. The authors present MMDiff, leveraging SE(3)-discrete diffusion, to concurrently design nucleic acid and protein complexes. This joint model incorporates both continuous diffusion processes for structure generation and discrete diffusion processes for sequence generation in a shared framework.

Methodology and Technical Contributions

MMDiff stands out by integrating SE(3) transformations with discrete noise processes, enabling more complex geometric modeling of macromolecules. The SE(3) group, referring to 3D rotations and translations, plays a crucial role in the structural modeling, while sequence modeling adopts a continuous one-hot vector representation approach. This combination allows for simultaneous modeling of structure and sequence design spaces – a key advancement over existing methods which typically handle these spaces separately.

Notably, the model architecture builds upon the structure of FrameDiff, adapted here for macromolecular complexes. This structural modification enables MMDiff to generate both protein and nucleic acid components in complex formations. Introducing an intra-molecule consensus sampling technique, it maintains internal consistency of the generated sequences by dynamically adjusting residue-type probabilities based on the molecule type consensus during sampling.

Experimental Evaluation and Results

The evaluation of MMDiff entails generating protein, nucleic acid, and protein-nucleic acid complexes, followed by an analysis based on designability, diversity, and novelty. Designability is assessed by comparing the co-designed structures against sequence-predicted structures using the scRMSD and scTM metrics via RoseTTAFold2NA. Diversity is gauged by clustering generated structures to observe variance amongst the output samples. Novelty considers how distinct the generated structures are in comparison to the training dataset.

Empirical results demonstrate that MMDiff successfully generates diverse and novel nucleic acid structures with 8.67% of nucleic acid samples achieving an scRMSD < 5 Å, outperforming a random generative baseline in every aspect. Protein-nucleic acid complexes proved more challenging, with avenues for improvement highlighted as training datasets expand. These results emphasize MMDiff's utility in producing diverse structural formations that extend beyond traditional single-molecule focuses.

Limitations and Future Directions

While MMDiff shows promise in nucleic acid design, its performance in generating protein-nucleic acid complexes reveals potential for refinement. Limited high-quality training data from databases such as the PDB restricts the modeling of complex interactions. Future work could seek to expand available training datasets, introduce full-atom modeling to capture subtle interactions more precisely, and explore the deployment of this model in real-world laboratories to validate the practical design outputs.

The implications of this research are significant for fields such as drug design, synthetic biology, and biotechnology, where the ability to jointly design sequences and structures can expedite the development of new therapeutic interventions and biomaterials. As computational power increases and more comprehensive macromolecular datasets become available, models like MMDiff could evolve to generate increasingly sophisticated and biologically relevant designs.

In conclusion, MMDiff represents a sophisticated approach to macromolecular generation, demonstrating effective joint sequence-structure modeling. By aligning sequence and structural prediction processes using SE(3)-discrete diffusion, it addresses a critical gap in the field and sets the stage for future advancements in integrative biomolecular design.