Dirichlet Diffusion Score Model for Biological Sequence Generation (2305.10699v2)
Abstract: Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
- Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint, 2022. doi: 10.48550/ARXIV.2205.15019.
- Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- A stochastic diffusion process for the dirichlet distribution. International Journal of Stochastic Analysis, 2013. ISSN 20903332. Report.
- A continuous time framework for discrete denoising models. arXiv preprint, 2022. doi: 10.48550/ARXIV.2205.14987.
- JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res., 50(D1):D165–D173, January 2022.
- A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet., 54(7):940–949, July 2022a.
- Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022b.
- A promoter-level mammalian expression atlas. Nature, 2014.
- Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615):49–56, 2022.
- A class of infinite-dimensional diffusion processes with connection to population genetics. Journal of Applied Probability, 44(4):938–949, 2007. doi: 10.1239/jap/1197908815.
- Diffusion processes and coalescent trees. arXiv preprint, 2010. doi: 10.48550/ARXIV.1003.4650.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020.
- An atlas of human long non-coding RNAs with accurate 5’ ends. Nature, 543(7644):199–204, March 2017.
- Autoregressive diffusion models. arXiv preprint, 2021a. doi: 10.48550/ARXIV.2110.02037.
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021b.
- A variational perspective on diffusion-based generative models and score matching. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021.
- Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst., 32, 2019.
- Generating and designing dna with deep generative models. arXiv preprint, 2017. doi: 10.48550/ARXIV.1712.06148.
- Kimura, M. Stochastic processes and distribution of gene frequencies under natural selection. Cold Spring Harb Symp Quant Biol., 20:33–53, 1955.
- Kimura, M. Some problems of stochastic processes in genetics. Ann. Math. Stat., 28(4):882–901, 1957.
- ProteinSGM: Score-based generative modeling for de novo protein design. bioRxiv, pp. 2022.07.13.499967, July 2022.
- Diffusion-LM Improves Controllable Text Generation. arXiv preprint, 2022. doi: 10.48550/ARXIV.2205.14217.
- Reflected diffusion models, 2023.
- Latent diffusion for language generation. arXiv preprint, 2022. doi: 10.48550/ARXIV.2212.09462.
- Antigen-Specific antibody design and optimization with Diffusion-Based generative models for protein structures. bioRxiv, pp. 2022.07.10.499510, October 2022.
- There is no 16-clue sudoku: Solving the sudoku minimum number of clues problem via hitting set enumeration. Exp. Math., 23(2):190–217, April 2014.
- Database indexing for production MegaBLAST searches. Bioinformatics, 24(16):1757–1764, August 2008.
- Recurrent relational networks. Adv. Neural Inf. Process. Syst., 31, 2018.
- Categorical sdes with simplex diffusion, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
- Score-based generative modeling through stochastic differential equations. arXiv preprint, 2020. doi: 10.48550/ARXIV.2011.13456.
- Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems, volume 34, pp. 1415–1428. Curran Associates, Inc., 2021.
- An explicit transition density expansion for a multi-allelic wright–fisher diffusion with general diploid selection. Theoretical population biology, 83:1–14, 2013a.
- An explicit transition density expansion for a multi-allelic Wright–Fisher diffusion with general diploid selection. Theor. Popul. Biol., 83:1–14, February 2013b.
- Score-based continuous-time discrete diffusion models. arXiv preprint, 2022. doi: 10.48550/ARXIV.2211.16750.
- Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst., 33:7537–7547, 2020.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint, 2022. doi: 10.48550/ARXIV.2206.04119.
- Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
- Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, pp. 6545–6554, 2019.
- Synthetic promoter design in escherichia coli based on a deep generative network. Nucleic Acids Research, 48(12):6403–6412, 2020.
- The punctilious RNA polymerase II core promoter. Genes Dev., 31(13):1289–1301, 2017.
- Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, pp. 2022.12.09.519842, December 2022.
- Controlling gene expression with deep generative design of regulatory DNA. Nat. Commun., 13(1):5099, August 2022.