Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion (2401.06151v1)
Abstract: Generative models of macromolecules carry abundant and impactful implications for industrial and biomedical efforts in protein engineering. However, existing methods are currently limited to modeling protein structures or sequences, independently or jointly, without regard to the interactions that commonly occur between proteins and other macromolecules. In this work, we introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex, using joint SE(3)-discrete diffusion noise. Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences. We demonstrate the utility of MMDiff through a rigorous new design benchmark for macromolecular complex generation that we introduce in this work. Our results demonstrate that MMDiff is able to successfully generate micro-RNA and single-stranded DNA molecules while being modestly capable of joint modeling DNA and RNA molecules in interaction with multi-chain protein complexes. Source code: https://github.com/Profluent-Internships/MMDiff.
- Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
- Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/958c530554f78bcd8e97125b70e6973d-Paper.pdf.
- Accurate prediction of nucleic acid and protein-nucleic acid complexes using rosettafoldna. bioRxiv, pages 2022–09, 2022.
- The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
- Ig-vae: Generative modeling of protein structure by direct 3d coordinate generation. PLoS computational biology, 18(6):e1010271, 2022.
- Protein complex prediction with alphafold-multimer. biorxiv, pages 2021–10, 2021.
- A latent diffusion model for protein structure generation. arXiv preprint arXiv:2305.04120, 2023.
- Geometric parameters in nucleic acids: sugar and phosphate constituents. Journal of the American Chemical Society, 118(3):519–529, 1996.
- Adaptive machine learning for protein engineering. Current Opinion in Structural Biology, 72:145–152, 2022. ISSN 0959-440X. doi: https://doi.org/10.1016/j.sbi.2021.11.002. URL https://www.sciencedirect.com/science/article/pii/S0959440X21001457.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
- Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pages 8867–8887. PMLR, 2022.
- Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pages 8946–8970. PMLR, 2022.
- Graphein-a python library for geometric deep learning and network analysis on biomolecular structures and interaction networks. Advances in Neural Information Processing Systems, 35:27153–27167, 2022.
- Multi-state rna design with geometric multi-graph neural networks. arXiv preprint arXiv:2305.14749, 2023.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
- Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv preprint arXiv:2301.12485, 2023.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022.
- Joint generation of protein sequence and structure with rosettafold sequence space diffusion. bioRxiv, pages 2023–05, 2023.
- Stefan Lutz. Beyond directed evolution—semi-rational protein engineering and design. Current Opinion in Biotechnology, 21(6):734–743, 2010. ISSN 0958-1669. doi: https://doi.org/10.1016/j.copbio.2010.08.011. URL https://www.sciencedirect.com/science/article/pii/S0958166910001540. Chemical biotechnology – Pharmaceutical biotechnology.
- E2efold-3d: end-to-end deep learning method for accurate de novo rna 3d structure prediction. arXiv preprint arXiv:2207.01586, 2022.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
- De novo design of protein structure and function with rfdiffusion. Nature, pages 1–3, 2023.
- Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022.
- Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, pages 38592–38610. PMLR, 2023.
- Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023.
- Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 19(9):1109–1115, 2022.