Unsupervised Feature Selection Methods
- Unsupervised feature selection methods are techniques that identify a minimal subset of informative features from unlabeled data by preserving intrinsic data geometry.
- They employ diverse approaches—including filter, graph-based, sparsity-regularized, and autoencoder models—to reduce dimensionality and enhance clustering, classification, and visualization.
- Robust optimization strategies like alternating minimization and augmented Lagrangian methods ensure convergence and interpretability despite challenges like nonconvexity and scalability.
Unsupervised feature selection methods aim to extract a subset of informative features from unlabeled high-dimensional data while preserving essential structure for downstream analysis such as clustering, classification, or visualization. These methods address the limitations posed by redundant, irrelevant, or noisy features and operate without supervision—a setting that refrains from using true label information. They span a wide landscape of principled approaches, including filter, wrapper, embedded, and deep learning paradigms, with mathematical underpinnings in spectral analysis, graph theory, subspace learning, sparsity regularization, and information geometry.
1. Theoretical Foundations and Motivations
Unsupervised feature selection tackles the problem of selecting a minimal set of features that retains the intrinsic data information required for learning tasks in the absence of class labels (Parveen et al., 2013). Unlike supervised approaches that leverage label correlation, unsupervised techniques must exploit alternative criteria such as manifold geometry, variance, local neighborhood structure, or self-expressiveness (Sun et al., 2020, Parsa et al., 2019). The chief motivations are:
- Dimensionality reduction: Reducing computational and storage burden, and improving algorithmic robustness against the curse of dimensionality.
- Model interpretability and generalization: By discarding irrelevant or redundant features, models can generalize better and offer clearer insight into data sources.
- Noise and outlier resilience: Robust selection approaches mitigate sensitivity to data anomalies (Yu et al., 21 Dec 2025).
- Structure preservation: Methods aim to maintain relevant sample relationships and latent cluster structures (Li et al., 2021, Liang et al., 2024).
2. Core Methodologies, Models, and Criteria
Unsupervised feature selection methods can be categorized by their principles and mathematical structures:
A. Filter Methods
These compute feature scores based on intrinsic properties or relationships (such as variance, correlation, or graph Laplacian), independent of downstream learning:
- Principal Component Analysis (PCA) projects data into uncorrelated directions capturing maximal variance, reducing feature dimensionality but sacrificing original-variable interpretability (Parveen et al., 2013).
- Empirical Distribution Ranking (EDR) orders features by statistics derived from their empirical distribution functions (Parveen et al., 2013).
- Compactness Score (CSUFS) directly scores features by evaluating local compactness, though specific algorithms require direct access to the cited paper (Zhu et al., 2022).
- Markov Multi-step Feature Selection (MMFS) uses multi-hop graph transition probabilities to capture both local and global data structures, offering both “negative” (structure-breaking) and “positive” (structure-preserving) selection rules (Min et al., 2020).
B. Graph- and Manifold-based Embedded Methods
These yield feature importance via optimizing embedding or clustering objectives:
- Laplacian Score, MCFS, and NDFS select features maximizing the preservation of local manifold structures built from k-NN graphs (Parveen et al., 2013).
- Dual Manifold Re-ranking (DMRR) integrates sample-sample, feature-feature, and sample-feature manifold affinity matrices to jointly update sample and feature importances with biconvex optimization (Liang et al., 2024).
- Graph Filtering Self-Representation (GFASR) leverages high-order graph filters (exp(-ηL)) for both smoothing and enforcing self-representation regularizers, combined with ℓ_{2,1}-norm feature sparsity (Liang et al., 2024).
C. Sparsity-Regularized Subspace Learning
These methods select features through subspace projection and group sparsity constraints:
- Structured Sparsity with Adaptive Graph (JASFS) enforces ℓ_{2,0}-norm on the transformation matrix and learns an adaptive similarity graph for robust feature selection with automatic determination of the number of features (Sun et al., 2020).
- Nonnegative Orthogonal Constrained Minimization (NOCRM) jointly embeds group-sparse regression and nonnegative spectral clustering with inexact ALM and PAM optimization and guaranteed KKT convergence (Li et al., 2024).
- Class Margin Optimization (UFCM) incorporates maximum margin criterion (between-cluster scatter) and within-cluster K-means compactness with a nonconvex ℓ_{2,p} sparsity penalty (Wang et al., 2015).
D. Block-Alternating and Bi-Level Frameworks
- Self-Paced Learning and Redundant Regularization (SPLR) employs self-paced sample weighting, subspace learning, explicit manifold and feature redundancy regularizers, and a nonconvex ℓ_{2,1/2}-norm to enhance robustness (Li et al., 2021).
- Bi-Level Framework (BLUFS) combines spectral clustering pseudo-label embedding with exact ℓ_{2,0} feature selection in a unified PAM algorithm (Liu et al., 26 May 2025).
E. Kernel and Autoencoder-based Methods
- Kernel Alignment UFS (KAUFS/MKAUFS) optimizes matrix factorization jointly with kernel and redundancy alignment, enabling both single and multiple kernel learning to capture nonlinear feature interactions (Lin et al., 2024).
- Robust Autoencoder and Adaptive Graph Learning (RAEUFS) couples a nonlinear autoencoder, robust subspace recovery, graph-regularized clustering, and group sparsity via alternating block minimization (Yu et al., 21 Dec 2025).
- Autoencoder Feature Selection (AEFS) (architecture and experiments require full paper access) employs a regression autoencoder and group-lasso for unsupervised selection.
F. Group Structure Models
- GroupFS discovers and sparsely selects latent feature groups via Laplacian smoothness on both sample and feature graphs, stochastic group gates (STG), and fully differentiable loss minimization (Lifshitz et al., 12 Nov 2025).
G. Subspace Clustering with Self-Expressiveness
- SCFS integrates self-expressive subspace clustering (joint learning of adaptive similarity) and row-sparse feature regression in a nonconvex alternating framework (Parsa et al., 2019).
H. Hypergraph-Based Models
- Point-Weighting Hypergraph Feature Selection (HPWL) constructs soft hypergraphs using data centroids, applies point- and hyperedge-weighting schemes, and optimizes local/global structure and low-rank correlations via block-coordinate descent (Gilani et al., 2018).
3. Optimization Algorithms and Convergence
Optimization strategies have evolved to accommodate nonconvexity, combinatorial sparsity, and manifold constraints:
- Alternating minimization or block-coordinate frameworks are universal, updating projection matrices, sample/feature weights, and affinity graphs in cycles (Li et al., 2021, Sun et al., 2020, Liang et al., 2024, Li et al., 2024, Parsa et al., 2019).
- Augmented Lagrangian Method (ALM) and proximal alternating minimization (PAM) are rigorously used with convergence guarantees to KKT points, ensuring stable and monotonic decrease of nonsmooth objectives (Li et al., 2024, Liu et al., 26 May 2025).
- Accelerated Matrix Homotopy Iterative Hard-Thresholding (AMHIHT) and coordinate descent are adopted for structured sparsity and ℓ_{2,0} constraints (Sun et al., 2020, Sun et al., 2024).
- Multiplicative update rules, spectral clustering eigen decompositions, and ADMM are employed in kernel and autoencoder factorization models (Lin et al., 2024, Liang et al., 2024).
- When group structure is unknown, fully differentiable frameworks with stochastic gates and Gumbel-softmax sampling allow group sparsity in continuous relaxation (Lifshitz et al., 12 Nov 2025).
4. Evaluation Protocols and Empirical Benchmarks
Universal evaluation protocol involves feature ranking and selection followed by clustering or classification on benchmark datasets, measuring clustering accuracy (ACC), normalized mutual information (NMI), purity, and sometimes redundancy rate (Yu et al., 21 Dec 2025, Sun et al., 2020, Liang et al., 2024, Parsa et al., 2019, Liu et al., 26 May 2025, Parveen et al., 2013):
| Paper (Method) | Datasets (examples) | Major Baselines | ACC/NMI Highlights |
|---|---|---|---|
| SPLR (Li et al., 2021) | USPS, Isolet, COIL20 | LS, MCFS, UDFS, RNE, SGFS | Best ACC on 7/9 datasets |
| JASFS (Sun et al., 2020) | Brain, MNIST, Jaffe | L-score, UDFS, RUFS, AUFS, UGFS | Wins on 5/8 datasets in ACC/NMI |
| DMRR (Liang et al., 2024) | WARPAR10P, LUNG | LapScore, MCFS, GRM, AGRM | ACC ↑ by ~12-14% over filters |
| SCFS (Parsa et al., 2019) | Lung, ORL, Isolet | LS, UDFS, NDFS, LDSSL | Best on every dataset in ACC |
| NOCRM (Li et al., 2024) | lung, Isolet, COIL20 | LS, MCFS, UDPFS | Outperforms all baselines |
| BLUFS (Liu et al., 26 May 2025) | Isolet, COIL20, lung | LapScore, MCFS, UDFS, SOGFS | ↑ Clustering and classification |
| RAEUFS (Yu et al., 21 Dec 2025) | lung, COIL20, USPS | URAFS, NNSE | Highest ACC/NMI; robust to outliers |
5. Feature Subset Redundancy, Robustness, and Interpretability
Effective methods extend beyond selection to:
- Redundancy minimization: Explicit penalties for feature-feature similarity or inner-product regularization (e.g. KAUFS/MKAUFS (Lin et al., 2024), SPLR (Li et al., 2021), GroupFS (Lifshitz et al., 12 Nov 2025)).
- Outlier robustness: Use of ℓ₁ loss (RAEUFS (Yu et al., 21 Dec 2025)), self-paced sample weighting (SPLR (Li et al., 2021)), entropy-regularized graphs (JASFS (Sun et al., 2020)).
- Adaptive graph learning: Re-assigns neighborhood structure iteratively, providing both global and local structure alignment (GFASR (Liang et al., 2024), HPWL (Gilani et al., 2018), SCFS (Parsa et al., 2019)).
- Group interpretability: GroupFS (Lifshitz et al., 12 Nov 2025) yields spatially and semantically coherent feature groups for analysis.
6. Algorithmic Complexity and Scalability
Complexity varies with method and target scale:
- Most alternating minimization methods scale as O(d3), O(nd2), or O(n2d) per iteration, depending on whether matrix inverses, eigen-decompositions, or graph-building operations dominate (Sun et al., 2020, Li et al., 2024, Liang et al., 2024).
- Methods utilizing low-rank factorization or centroids (HPWL (Gilani et al., 2018)) minimize computational overhead, reaching convergence in ≈2 outer iterations.
- For deep models (RAEUFS (Yu et al., 21 Dec 2025), AEFS (Han et al., 2017)), per-epoch complexity depends on autoencoder depth and chosen optimization algorithms.
- Graph construction steps (GroupFS (Lifshitz et al., 12 Nov 2025), DMRR (Liang et al., 2024)) can become prohibitive when n≫10⁴; random-projection or anchor-graph approximations are proposed for scalability.
7. Limitations and Future Directions
Recognized constraints include:
- Parameter sensitivity: Several hyperparameters require grid-search tuning for optimal performance, e.g. regularization weights, graph construction parameters.
- Scalability: Large n or d may require approximate or randomized methods, particularly for graph-based constructions.
- Local minima and nonconvexity: Nonconvex regularizations (ℓ₂,₀, ℓ_{2,1/2}) and deep learning models admit only local convergence guarantees.
- Group discovery and dynamic selection: Adaptive group selection remains a challenge; methods (GroupFS (Lifshitz et al., 12 Nov 2025)) suggest directions for dynamic or context-sensitive grouping.
- Integration with deep, multi-view, and semi-supervised models: Kernel alignment, multi-task meta-learning, and hybrid models are emerging for improved structure capture and sample efficiency (Kumagai et al., 2021, Han et al., 2017).
Future research will likely focus on scalable, adaptive graph construction, deep manifold modeling, and unsupervised meta-learning paradigms (Kumagai et al., 2021). Integration of interpretable grouping and dynamic selection mechanisms is anticipated to benefit applications in vision, genomics, and social science data analysis, where reliable structure must be inferred absent labels.