Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data (2308.01839v2)

Published 3 Aug 2023 in q-bio.QM, cs.CV, q-bio.GN, stat.AP, and stat.ML

Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. SpaGE: Spatial Gene Enhancement using scRNA-seq. Nucleic Acids Research, 48(18):e107, Oct. 2020.
  2. A blood atlas of covid-19 defines hallmarks of disease severity and specificity. Cell, 185(5):916–938, 2022.
  3. Molecular and spatial signatures of mouse brain aging at single-cell resolution. Cell, 186(1):194–208.e18, 2023.
  4. Expansion sequencing: Spatially precise in situ transcriptomics in intact biological systems. Science, 371(6528):eaax2656, 2021.
  5. A random matrix theory approach to denoise single-cell data. Patterns, 1(3), 2020.
  6. Mofa+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome biology, 21(1):1–17, 2020.
  7. A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart. Cell, 179(7):1647–1660, 2019.
  8. Statistical test of structured continuous trees based on discordance matrix. Bioinformatics, 35(23):4962–4970, 2019.
  9. Z. Bai and X. Ding. Estimation of spiked eigenvalues in spiked models. Random Matrices: Theory and Applications, 1(02):1150011, 2012.
  10. Z. Bai and J. Yao. On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis, 106:167–177, 2012.
  11. J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97(6):1382–1408, 2006.
  12. M. Balasubramanian and E. L. Schwartz. The isomap algorithm and topological stability. Science, 295(5552):7–7, 2002.
  13. Statistical inference for principal components of spiked covariance matrices. The Annals of Statistics, 50(2):1144–1169, 2022.
  14. F. Batool and C. Hennig. Clustering with the average silhouette width. Computational Statistics & Data Analysis, 158:107190, 2021.
  15. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
  16. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002.
  17. Molecular definition of the identity and activation of natural killer cells. Nature Immunology, 13(10):1000–1009, 2012.
  18. A test metric for assessing single-cell rna-seq batch correction. Nature Methods, 16(1):43–49, 2019.
  19. Limiting laws for divergent spiked eigenvalues and largest nonspiked eigenvalue of sample covariance matrices. The Annals of Statistics, 48(3):1255–1280, 2020.
  20. T. Caliński and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods, 3(1):1–27, 1974.
  21. Manifold alignment for heterogeneous single-cell multi-omics data integration using pamona. Bioinformatics, 38(1):211–219, 2022.
  22. T. Chari and L. Pachter. The specious art of single-cell genomics. PLOS Computational Biology, 19(8):e1011288, 2023.
  23. The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science, 376(6594):eabl4896, 2022.
  24. A novel metric reveals previously unrecognized distortion in dimensionality reduction of scrna-seq data. Biorxiv, page 689851, 2019.
  25. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2:224–227, 1979.
  26. Scot: single-cell multi-omics alignment with optimal transport. Journal of Computational Biology, 29(1):3–18, 2022.
  27. Systematic comparative analysis of single cell rna-sequencing methods. BioRxiv, page 632216, 2019.
  28. X. Ding and R. Ma. Learning low-dimensional nonlinear structures from high-dimensional noisy data: An integral operator approach. The Annals of Statistics, 51(4):1744–1769, 2023.
  29. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594):eabl5197, 2022.
  30. Screenot: Exact mse-optimal singular value thresholding in correlated noise. The Annals of Statistics, 51(1):122–148, 2023.
  31. Statistical shape analysis: with applications in R, volume 995. John Wiley & Sons, 2016.
  32. Single-cell atlases: shared and tissue-specific cell types across human organs. Nature Reviews Genetics, 23(7):395–410, 2022.
  33. Comprehensive analysis of single cell atac-seq data with snapatac. Nature Communications, 12(1):1337, 2021.
  34. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data. Genome Biology, 16(1):1–13, 2015.
  35. M. Gavish and D. L. Donoho. Optimal shrinkage of singular values. IEEE Transactions on Information Theory, 63(4):2137–2152, 2017.
  36. C. Goodall. Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society: Series B (Methodological), 53(2):285–321, 1991.
  37. J. Griffiths and A. Lun. MouseGastrulationData: Single-Cell -omics Data across Mouse Gastrulation and Early Organogenesis, 2021. R package version 1.8.0, https://github.com/MarioniLab/MouseGastrulationData.
  38. Hybridization-based in situ sequencing (hybiss) for spatially resolved transcriptomics in human and mouse brain tissue. Nucleic Acids Research, 48(19):e112–e112, 2020.
  39. Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 36(5):421–427, 2018.
  40. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nature Biotechnology, 37(6):685–691, 2019.
  41. R. O. Hynes. Integrins: versatility, modulation, and signaling in cell adhesion. Cell, 69(1):11–25, 1992.
  42. I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295–327, 2001.
  43. Functional inference of gene regulation using single-cell multi-omics. Cell Genomics, 2(9), 2022.
  44. Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis. Journal of the American Statistical Association, 118(541):374–392, 2023.
  45. Fast, sensitive and accurate integration of single-cell data with harmony. Nature Methods, 16(12):1289–1296, 2019.
  46. S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesis testing and random matrix theory. IEEE Transactions on Signal Processing, 57(10):3930–3941, 2009.
  47. Eleven grand challenges in single-cell data science. Genome Biology, 21(1):1–35, 2020.
  48. B. Landa and Y. Kluger. The dyson equalizer: Adaptive noise stabilization for low-rank signal detection and recovery. arXiv preprint arXiv:2306.11263, 2023.
  49. Biwhitening reveals the rank of a count matrix. SIAM Journal on Mathematics of Data Science, 4(4):1420–1446, 2022.
  50. W. E. Leeb. Matrix denoising for weighted loss functions and heterogeneous signals. SIAM Journal on Mathematics of Data Science, 3(3):987–1012, 2021.
  51. S. Leviyang. A random matrix approach to single cell rna-seq analysis. bioRxiv, pages 2023–06, 2023.
  52. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nature Methods, 19(6):662–670, 2022.
  53. Asymptotic joint distribution of extreme eigenvalues and trace of large sample covariance matrix in a generalized spiked population model. The Annals of Statistics, 48(6):3138–3160, 2020.
  54. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nature Biotechnology, 40(1):74–85, 2022.
  55. Spacetx: A roadmap for benchmarking spatial transcriptomics exploration of the brain. arXiv preprint arXiv:2301.08436, 2023.
  56. Benchmarking atlas-level data integration in single-cell genomics - integration task datasets., 2020. Figshare. Dataset., https://doi.org/10.6084/m9.figshare.12420968.v7.
  57. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 19(1):41–50, 2022.
  58. Benchmarking single-cell rna-sequencing protocols for cell atlas projects. Nature Biotechnology, 38(6):747–755, 2020.
  59. L. Moses and L. Pachter. Museum of spatial transcriptomics. Nature Methods, 19(5):534–546, 2022.
  60. R. R. Nadakuditi. Optshrink: An algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage. IEEE Transactions on Information Theory, 60(5):3002–3018, 2014.
  61. M. Nitzan and M. P. Brenner. Revealing lineage-related signals in single-cell gene expression using random matrix theory. Proceedings of the National Academy of Sciences, 118(11):e1913931118, 2021.
  62. A novel statistical method to diagnose, quantify and correct batch effects in genomic studies. Scientific Reports, 7(1):10849, 2017.
  63. D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642, 2007.
  64. Integrating temporal single-cell gene expression modalities for trajectory inference and disease prediction. Genome Biology, 23(1):1–32, 2022.
  65. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics, 29(22):2877–2883, 2013.
  66. limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Research, 43(7):e47–e47, 2015.
  67. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  68. Satija Lab. panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies, 2019. R package ’Seurat’ version 3.0.2.
  69. Feature-based correspondence: an eigenvector approach. Image and Vision Computing, 10(5):283–288, 1992.
  70. stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics. Bioinformatics, 37(Supplement_1):i299–i307, July 2021.
  71. Single-cell multi-omics analysis of the immune response in covid-19. Nature Medicine, 27(5):904–916, 2021.
  72. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019.
  73. Alignment of single-cell trajectory trees with capital. Nature Communications, 13(1):5972, 2022.
  74. A. Tanay and A. Regev. Scaling single-cell genomics from phenomenology to mechanism. Nature, 541(7637):331–338, 2017.
  75. Shared and distinct transcriptomic cell types across neocortical areas. Nature, 563(7729):72–78, 2018.
  76. A benchmark of batch-effect correction methods for single-cell rna sequencing data. Genome Biology, 21:1–32, 2020.
  77. C. Trapnell. Defining cell types and states with single-cell genomics. Genome Research, 25(10):1491–1498, 2015.
  78. High-resolution alignment of single-cell and spatial transcriptomes with cytospace. Nature Biotechnology, pages 1–6, 2023.
  79. Spatial charting of single-cell transcriptomes in tissues. Nature Biotechnology, 40(8):1190–1199, 2022.
  80. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell, 177(7):1873–1887, 2019.
  81. Z. Wu and H. Wu. Accounting for cell type hierarchy in evaluating single cell rna-seq clustering. Genome Biology, 21(1):1–14, 2020.
  82. Sample covariance matrices and high-dimensional data analysis. Cambridge UP, New York, 2015.
  83. scone-seq: A single-cell multi-omics method enables simultaneous dissection of phenotype and genotype heterogeneity from frozen tumors. Science Advances, 9(1):eabp8901, 2023.
  84. A highly scalable method for joint whole-genome sequencing and gene-expression profiling of single cells. Molecular Cell, 80(3):541–553, 2020.
  85. Asymptotic independence of spiked eigenvalues and linear spectral statistics for large sample covariance matrices. The Annals of Statistics, 50(4):2205–2230, 2022.
Citations (6)

Summary

We haven't generated a summary for this paper yet.