K-Nearest-Neighbors Induced Topological PCA for scRNA Sequence Data Analysis
Abstract: Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L$_{2,1}$ norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins.
- A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. 2016.
- Peter V Kharchenko. The triumphs and limitations of computational methods for scrna-seq. Nature Methods, 18(7):723–732, 2021.
- Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019.
- Single-cell rna-seq technologies and related computational data analysis. Frontiers in genetics, page 317, 2019.
- Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020.
- A statistical simulator scdesign for rational scrna-seq experimental design. Bioinformatics, 35(14):i41–i50, 2019.
- Tutorial: guidelines for the computational analysis of single-cell rna sequencing data. Nature protocols, 16(1):1–9, 2021.
- Eleven grand challenges in single-cell data science. Genome biology, 21(1):1–35, 2020.
- Deep learning tackles single-cell analysis—a survey of deep learning for scrna-seq analysis. Briefings in bioinformatics, 23(1):bbab531, 2022.
- Statistics or biology: the zero-inflation controversy about scrna-seq data. Genome biology, 23(1):1–24, 2022.
- Sinnlrr: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics, 35(19):3642–3650, 2019.
- Visualization and analysis of single-cell rna-seq data by kernel-based similarity learning. Nature methods, 14(4):414–416, 2017.
- Deep learning tackles single-cell analysis a survey of deep learning for scrna-seq analysis, 2021.
- Scdrha: A scrna-seq data dimensionality reduction algorithm based on hierarchical autoencoder. Frontiers in Genetics, 12, 2021.
- Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nature Communications, 9, 05 2018.
- A topology-preserving dimensionality reduction method for single-cell rna-seq data using graph autoencoder. Scientific Reports, 11:20028, 10 2021.
- Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
- Umap: Uniform manifold approximation and projection for dimension reduction, 2020.
- Computational approaches for interpreting scrna-seq data. FEBS letters, 591, 05 2017.
- Sparse principal component analysis via joint l 2,1-norm penalty. volume 8272, pages 148–159, 12 2013.
- Ccp: Correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:2206.04189, 2022.
- Robust graph regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering. Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022. PMID: 36459053.
- Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics, 33, 09 2016.
- I. Jolliffe. Principal component analysis. Encyclopedia of statistics in behavioral science, 2005.
- Non-greedy l21-norm maximization for principal component analysis, 2016.
- Graph-laplacian pca: Closed-form solution and robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3492–3498, 2013.
- Plpca: Persistent laplacian enhanced-pca for microarray data analysis, 2023.
- Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020.
- Persistent laplacians: Properties, algorithms and implications. SIAM Journal on Mathematics of Data Science, 4(2):858–884, 2022.
- Persistent sheaf laplacians. arXiv preprint arXiv:2112.10906, 2021.
- The algebraic stability for persistent laplacians. arXiv preprint arXiv:2302.03902, 2023.
- Persistent hyperdigraph homology and persistent hyperdigraph laplacians. arXiv preprint arXiv:2304.00345, 2023.
- Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in Biology and Medicine, 151:106262, 2022.
- Persistent spectral theory-guided protein engineering. Nature Computational Science, 3(2):149–163, 2023.
- Persistent spectral–based machine learning (perspect ml) for protein-ligand binding affinity prediction. Science advances, 7(19):eabc5329, 2021.
- I. Jolliffe and J. Cadima. Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016.
- M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems, 14, 2001.
- Evolutionary de rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021.
- Hermes: Persistent spectral graph software. Foundations of data science (Springfield, Mo.), 3(1):67, 2021.
- Persistent homology with k-nearest-neighbor filtrations reveals topological convergence of pagerank, 2022.
- A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences of the United States of America, 112:7285 – 7290, 2015.
- Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biology, 17, 08 2016.
- Cellular taxonomy of the mouse striatum as revealed by single-cell rna-seq. Cell reports, 16 4:1126–1137, 2016.
- Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science, 356(6335):eaah4573, 2017.
- A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell systems, 3:346–360, 10 2016.
- Single-cell rna-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 343(6167):193–196, 2014.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.