Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quiver Laplacians and Feature Selection (2404.06993v1)

Published 10 Apr 2024 in stat.ML, cs.LG, math.CO, math.RT, math.ST, q-bio.QM, and stat.TH

Abstract: The challenge of selecting the most relevant features of a given dataset arises ubiquitously in data analysis and dimensionality reduction. However, features found to be of high importance for the entire dataset may not be relevant to subsets of interest, and vice versa. Given a feature selector and a fixed decomposition of the data into subsets, we describe a method for identifying selected features which are compatible with the decomposition into subsets. We achieve this by re-framing the problem of finding compatible features to one of finding sections of a suitable quiver representation. In order to approximate such sections, we then introduce a Laplacian operator for quiver representations valued in Hilbert spaces. We provide explicit bounds on how the spectrum of a quiver Laplacian changes when the representation and the underlying quiver are modified in certain natural ways. Finally, we apply this machinery to the study of peak-calling algorithms which measure chromatin accessibility in single-cell data. We demonstrate that eigenvectors of the associated quiver Laplacian yield locally and globally compatible features.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature methods, 10(12):1213–1218, 2013.
  2. F. R. Chung. The Laplacian of a hypergraph. Expanding graphs, 10:21–36, 1992.
  3. F. R. Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
  4. G. Cooper and K. Adams. The cell: a molecular approach. Oxford University Press, 2022.
  5. Discrete Morse theory for computing cellular sheaf cohomology. Foundations of Computational Mathematics, 16:875–897, 2016.
  6. J. M. Curry. Sheaves, cosheaves and applications. University of Pennsylvania, 2014.
  7. F. Dorfler and F. Bullo. Synchronization of power networks: Network reduction and effective resistance. IFAC Proceedings Volumes, 43(19):197–202, 2010.
  8. S. Eilenberg and N. Steenrod. Foundations of algebraic topology, volume 2193. Princeton University Press, 2015.
  9. Integration of TP53, DREAM, MMB-FOXM1 and RB-E2F target gene analyses identifies cell cycle gene regulatory networks. Nucleic acids research, 44(13):6070–6086, 2016.
  10. The geometry of synchronization problems and learning group actions. Discrete & Computational Geometry, 65(1):150–211, 2021.
  11. R. Ghrist and H. Riess. Cellular sheaves of lattices and the Tarski Laplacian. arXiv preprint arXiv:2007.04099, 2020.
  12. T. E. Goldberg. Combinatorial Laplacians of Simplicial Complexes. Senior thesis, Bard College, 2002.
  13. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, 3rd ed edition, 1996.
  14. Spectral distances on graphs. Discrete Applied Mathematics, 190:56–74, 2015.
  15. Gene selection for cancer classification using support vector machines. Machine learning, 46:389–422, 2002.
  16. J. Hansen and R. Ghrist. Toward a Spectral Theory of Cellular Sheaves. Journal of Applied and Computational Topology, 3(4), 2019.
  17. A. N. Hirani. Discrete exterior calculus. California Institute of Technology, 2003.
  18. Matrix Analysis. Cambridge University Press, second edition, corrected reprint edition, 2017.
  19. I. T. Jolliffe. Principal Components Analysis, 2nd Ed. Springer, 2002.
  20. A sheaf theoretical approach to uncertainty quantification of heterogeneous geolocation information. Sensors, 20(12):3418, 2020.
  21. C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
  22. ARPACK-NG: Large scale eigenvalue problem solver. Astrophysics Source Code Library, pages ascl–2306, 2023.
  23. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  24. Multilinear hyperquiver representations. arXiv:2305.05622v2 [math.AG], 2023.
  25. Quiver signal processing (QSP). arXiv preprint arXiv:2010.11525, 2020.
  26. K. Pearson F.R.S. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  27. C. Rébé and F. Ghiringhelli. STAT3, a master regulator of anti-tumor immune response. Cancers, 11(9):1280, 2019.
  28. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nature Biotechnology, 37(8):925–936, 2019.
  29. R. Schiffler. Quiver Representations. Number 184 in CMS Books in Mathematics. Springer, 2014.
  30. Principal components along quiver representations. Foundations of Computational Mathematics, 23(4):1129–1165, 2023.
  31. R. L. Smith. Some interlacing properties of the Schur complement of a Hermitian matrix. Linear Algebra and its Applications, 177:137–144, 1992.
  32. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
  33. From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports, 9(1):5233, 2019.
  34. J. Van Den Heuvel. Hamilton cycles and eigenvalues of graphs. Linear algebra and its applications, 226:723–730, 1995.
  35. Deleting vertices and interlacing Laplacian eigenvalues. Chinese Annals of Mathematics, Series B, 31(2):231–236, 2010.
  36. From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis. Genome biology, 21:1–16, 2020.
  37. K. Ye and L.-H. Lim. Schubert varieties and distances between subspaces of different dimensions. SIAM Journal on Matrix Analysis and Applications, 37(3):1176–1197, 2016.
  38. Single-cell ATAC-seq analysis via network refinement with peaks location information. bioRxiv, page 2022.11.18.517159, 2022.
  39. Model-based analysis of ChIP-Seq (MACS). Genome biology, 9(9):1–9, 2008.

Summary

  • The paper presents a novel framework using quiver representations to identify globally compatible features in high-dimensional, decomposed datasets.
  • The paper demonstrates how variations in the Laplacian spectrum effectively correlate with feature selection performance in noisy, overlapping data subsets.
  • The paper applies the method to scATAC-seq data, successfully extracting chromatin accessibility peaks across diverse cell types.

Analyzing Feature Selection Algorithms through Quiver Representation and Laplacian Spectra with Applications to scATAC-seq Data

Introduction to Feature Selection in Decomposed Datasets

In the field of high-dimensional data analysis, one often faces the challenge of identifying the most relevant features within a dataset. The field of natural language processing, for instance, has seen the development of techniques for embedding words in a Euclidean space, where coordinates serve as features, to capture semantic relationships. For any given dataset, extracting features that are most informative for the tasks at hand is critical. This process, known as feature selection, relies on various methods to quantify feature relevance. Amidst the diversity of feature selection algorithms, discrepancies often arise when applying these algorithms to subsets of a whole dataset, particularly in scenarios with overlapping subsets, such as biological datasets organized by cell or disease types.

Quiver Representations for Feature Selection across Dataset Decompositions

A novel framework is introduced for the systematic selection of features across decomposed sets of data, accommodating overlaps among subsets. By abstracting feature selectors as deterministic processes and utilizing quiver representations valued in finite-dimensional Hilbert spaces, the method isolates the largest subspace of selected features that remain consistent with respect to the dataset's decomposition into subsets. The construction relies on both local and global forms of feature compatibility, considering restrictions and extensions across subsets. This abstract approach applies generally to any feature selector, urging a compatibility framework that accounts for approximate sections of quiver representations to handle noise and high correlation among features.

The Quiver Laplacian and Approximate Sections

A quiver Laplacian is introduced, serving as a cornerstone of this framework by associating sections of the quiver representation with globally compatible features and defining approximate sections through the eigenspace of the Laplacian. The paper empirically establishes how variations in the spectrum of a quiver Laplacian correspond to changes in feature selection processes, allowing for the efficient identification of relevant features. This methodology bridges the theoretical understanding of feature selection with practical applications, particularly in analyzing single-cell sequencing data to identify relevant genomic features.

Applying the Framework to Single-Cell Chromatin Accessibility Data

The method is applied to single-cell ATAC-seq data for peak calling—a process critical for determining chromatin accessibility across different cell types within a sample. Through the construction of a quiver Laplacian, the paper demonstrates the extraction of locally and globally compatible features (peaks) related to chromatin accessibility. The application efficiently handles the massive dimensionality characteristic of genomic datasets and successfully identifies genomic regions relevant across various cell types, highlighting the versatility of the framework in accommodating the complexity of biological data.

Implications and Future Directions in AI and Data Analysis

The development of a principled framework for feature selection in decomposed datasets paves the way for more accurate and interpretable data analysis across various fields. By providing a robust theoretical foundation and demonstrating applicability to complex biological data, this research opens avenues for future developments in AI and data science. It specifically invites further investigation into the behavior of quiver Laplacians and their applicability in other domains where datasets inherently consist of overlapping subsets. The adaptability of the approach to accommodate noise and feature correlation holds promise for enhancing feature selection methodologies across disciplines.