scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling (2404.06153v1)
Abstract: Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual datasets that share analogous statistical properties. Results: Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual scRNA-seq data by leveraging a real dataset. The method is a neural network constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the real dataset through iterative noise-adding steps and ultimately restoring the noises to form scRNA-seq samples. This scheme allows us to learn data features from actual scRNA-seq samples during model training. Our experiments, conducted on two distinct scRNA-seq datasets, demonstrate superior performance. Additionally, the model sampling process is expedited by incorporating Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified methodology empowering users to train neural network models with their unique scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq samples. Availability and implementation: https://github.com/DongShengze/scRDiT
- RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10(1):57–63, 2009.
- RNA-seq: from technology to biology. Cell. Mol. Life Sci., 67:569–579, 2010.
- rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. USA, 111(51):E5593–E5601, 2014.
- Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res., 40(10):4288–4297, 2012.
- Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nat. Biotechnol., 30(3):253–260, 2012.
- mRNA-seq whole-transcriptome analysis of a single cell. Nat. Methods, 6(5):377–382, 2009.
- SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods, 14(5):483–486, May 2017.
- Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, Jun 2019.
- SingleCellNet: A computational tool to classify single cell RNA-seq data across platforms and across species. Cell Syst., 9(2):207–213, Aug 2019.
- Supervised classification enables rapid annotation of cell atlases. Nat. Methods, 16(10):983–986, Oct 2019.
- scAlign: a tool for alignment, integration, and rare cell identification from scRNA-seq data. Genome Biol., 20(1):166, Aug 2019.
- GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol., 19(1):58, May 2018.
- Discovery of rare cells from voluminous single cell expression data. Nat. Commun., 9(1):4719, Nov 2018.
- PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data. Genome Biol., 22(1):124, Apr 2021.
- Bayesian approach to single-cell differential expression analysis. Nat. Methods, 11(7):740–742, Jul 2014.
- Trajectory-based differential expression analysis for single-cell sequencing data. Nat. Commun., 11(1):1201, Mar 2020.
- The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol., 32(4):381–386, Apr 2014.
- Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods, 14(10):979–982, Oct 2017.
- Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics, 19(1):477, Jun 2018.
- The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496–502, Feb 2019.
- Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci., 14(5):365–376, 2013.
- A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics, 35(15):i41–i50, 2019.
- scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol., 22:163, 2022.
- A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun., 9:284, 2018.
- Splatter: simulation of single-cell RNA sequencing data. Genome Biol., 18:174, 2017.
- SPARSim single cell: a count data simulator for scRNA-seq data. Bioinformatics, 36(5):1468–1475, 2020.
- Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun., 11(1):166, 2020.
- Deep generative modeling for single-cell transcriptomics. Nat. Methods, 15:1053–1058, 2018.
- Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun., 19:2611, 2019.
- A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol., 17:222, 2016.
- Sarah Webb et al. Deep learning for biology. Nature, 554(7693):555–557, 2018.
- Application of deep learning methods in biological networks. Brief. Bioinformatics, 22(2):1902–1917, 2021.
- Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface, 15(141):20170387, 2018.
- Generative adversarial networks. Commun. ACM, 63(11):139–144, 2020.
- Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation. Comput. Biol. Med., 169:107828, 2024.
- Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics, 36(11):3418–3421, 2020.
- Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst., 33:6840–6851, 2020.
- Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst., 34:8780–8794, 2021.
- Diffusion models in bioinformatics: A new wave of deep learning revolution in action. arXiv, page 2302.10907, 2023.
- DiffRNAFold: Generating RNA tertiary structures with latent space diffusion. In Deep Generative Models for Health Workshop NeurIPS 2023, 2023.
- Benjamin L. Kidder. Advanced image generation for cancer using diffusion models. bioRxiv, page 2023.08.18.553859, 2023.
- Dirichlet diffusion score model for biological sequence generation. arXiv, page 2305.10699, 2023.
- U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Single-cell RNA-seq synthesis with latent diffusion model. arXiv, page 2312.14220, 2023.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Denoising diffusion implicit models. arXiv, page 2010.02502, 2020.
- Use of coefficient of variation in assessing variability of quantitative assays. Clin. Vaccine Immunol., 9(6):1235–1239, 2002.
- Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. Cell, 171(7):1611–1624, 2017.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. J. Mach. Learn. Res., 9(11), 2008.
- Integrated analysis of single-cell and spatial transcriptomics in keloids: highlights on fibrovascular interactions in keloid pathogenesis. J. Invest. Dermatol., 142(8):2128–2139, 2022.