Capturing the Denoising Effect of PCA via Compression Ratio (2204.10888v2)
Abstract: Principal component analysis (PCA) is one of the most fundamental tools in machine learning with broad use as a dimensionality reduction and denoising tool. In the later setting, while PCA is known to be effective at subspace recovery and is proven to aid clustering algorithms in some specific settings, its improvement of noisy data is still not well quantified in general. In this paper, we propose a novel metric called \emph{compression ratio} to capture the effect of PCA on high-dimensional noisy data. We show that, for data with \emph{underlying community structure}, PCA significantly reduces the distance of data points belonging to the same community while reducing inter-community distance relatively mildly. We explain this phenomenon through both theoretical proofs and experiments on real-world data. Building on this new metric, we design a straightforward algorithm that could be used to detect outliers. Roughly speaking, we argue that points that have a \emph{lower variance of compression ratio} do not share a \emph{common signal} with others (hence could be considered outliers). We provide theoretical justification for this simple outlier detection algorithm and use simulations to demonstrate that our method is competitive with popular outlier detection tools. Finally, we run experiments on real-world high-dimension noisy data (single-cell RNA-seq) to show that removing points from these datasets via our outlier detection method improves the accuracy of clustering algorithms. Our method is very competitive with popular outlier detection tools in this task.
- A novel approach for outlier detection and clustering improvement. In 2013 IEEE 8th Conference on Industrial Electronics and Applications (iciea), pages 577–582. IEEE, 2013.
- TW Anderson. An introduction to multivariate statistical analysis. Wiley google schola, 2:289–300, 1958.
- A principal component noise filter for high spectral resolution infrared measurements. Journal of Geophysical Research: Atmospheres, 109(D23), 2004.
- Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 37–49. Springer, 2012.
- Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000.
- Unsupervised learning algorithms, volume 9. Springer, 2016.
- The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. The Annals of Statistics, 2019.
- K-means clustering via principal component analysis. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, page 29, New York, NY, USA, 2004. Association for Computing Machinery.
- The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
- A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research, 7:1141, 11 2020.
- Paul Hanoine. An eigenanalysis of data centering in machine learning. Preprint, 2014.
- Integrated analysis of multimodal single-cell data. Cell, 2021.
- Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142–32159, 2022.
- DC Hoyle and M Rattray. Pca learning for sparse high-dimensional data. Europhysics Letters, 62(1):117, 2003.
- J Edward Jackson. A user’s guide to principal components. John Wiley & Sons, 2005.
- Challenges in unsupervised clustering of single-cell rna-seq data. Nature Reviews Genetics, 20(5):273–282, 2019.
- Noise reduction in solid-state nmr spectra using principal component analysis. The Journal of Physical Chemistry A, 123(47):10333–10338, 2019. PMID: 31682439.
- Clustering with spectral norm and the k-means algorithm. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 299–308. IEEE, 2010.
- Bingbing Li. A principal component analysis approach to noise removal for speech denoising. In 2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS), pages 429–432, 2018.
- Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008.
- Ecod: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE Transactions on Knowledge and Data Engineering, 35(12):12181–12193, 2022.
- Optimality of spectral clustering in the gaussian mixture model. The Annals of Statistics, 49(5):2506–2530, 2021.
- Pca based image denoising. Signal & Image Processing, 3, 04 2012.
- On the power of svd in the stochastic block model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Detecting hidden communities by power iterations with connections to vanilla spectral algorithms. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 846–879. SIAM, 2024.
- Boaz Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics, 36(6):2791 – 2817, 2008.
- Raj Rao Nadakuditi. Optshrink: An algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage. IEEE Transactions on Information Theory, 60(5):3002–3018, 2014.
- Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 427–438, 2000.
- The littlewood–offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.
- A gaussian scenario for unsupervised learning. Journal of Physics A: Mathematical and General, 29(13):3521, 1996.
- The single-cell sequencing: new developments and medical applications. Cell & Bioscience, 9(1):53, 2019.
- Umap as a dimensionality reduction tool for molecular dynamics simulations of biomacromolecules: a comparison study. The Journal of Physical Chemistry B, 125(19):5022–5034, 2021.
- K Kirschner V Kiselev and M Schaub. SC3: consensus clustering of single-cell RNA-seq data. Nature Methods, 14:483–486, 2017.
- Finite sample guarantees for pca in non-isotropic and data-dependent noise. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 783–789. IEEE, 2017.
- Van Vu. A simple svd algorithm for finding hidden partitions. Combinatorics, Probability and Computing, 27(1):124–140, 2018.
- Van Vu and Ke Wang. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures & Algorithms, 47(4):792–821, 2015.
- Benchmarking computational doublet-detection methods for single-cell rna sequencing data. Cell systems, 12(2):176–194, 2021.
- A comprehensive survey of clustering algorithms. Annals of Data Science, 2:165–193, 2015.