Clustering-Based Data Selection

Updated 7 October 2025

The framework leverages clustering to identify representative subsets, reducing redundancy and boosting model performance.
It integrates variable, sample, and gradient-space clustering approaches to optimize feature selection and enhance interpretability.
Empirical studies in AutoML, recommender systems, and fine-tuning models validate its efficiency and robustness in high-dimensional settings.

A clustering-based data selection framework is a methodological paradigm in which clustering algorithms are leveraged not only for unsupervised partitioning of data but as integral mechanisms for variable selection, feature subspace identification, data subset selection, or algorithm selection tailored to downstream analytical tasks. Across diverse research lines, such frameworks exploit the structure imposed by clustering—either on samples, variables, or feature representations—to optimize model performance, interpretability, and computational efficiency in high-dimensional or heterogeneous settings. The development of clustering-based data selection frameworks spans fields including statistical learning, unsupervised feature selection, active learning, recommender systems, and automated machine learning.

1. Principles and Motivation

Clustering-based data selection frameworks operate on the guiding principle that intrinsic structure—whether among data points (samples), variables (features), or even tasks—can inform more targeted or efficient modeling strategies by reducing redundancy or focusing on representative subsets. The motivations are multifaceted:

Dimensionality reduction: High-dimensional data sets often contain subsets of features or samples that are redundant or uninformative. Clustering can uncover latent dependencies or similarity patterns, facilitating selection of informative groups or representatives (Chavent et al., 2016, Andrews et al., 2013).
Computational efficiency: By selecting exemplars or groups (e.g., via clustering in embedding space or gradient space), one can dramatically reduce the computational budget required for training, inference, or hyperparameter optimization (Axiotis et al., 27 Feb 2024, Wang et al., 12 Jun 2025).
Model robustness and interpretability: Variable clustering or clustering of correlated features enhances model interpretability, especially when cluster representatives are selected for further analysis (Spooner et al., 2022, Chavent et al., 2016).
Task adaptation and personalization: In recommender systems or AutoML, clustering users or datasets enables selection of optimal algorithms or models for each partition, yielding measurable accuracy and efficiency improvements (Lizenberger et al., 28 May 2024, Singh et al., 15 Jul 2024).

2. Clustering Modalities in Data Selection

Clustering-based selection frameworks can be classified by the axis or substrate of clustering:

Variable/Feature Clustering: Groups features based on similarity (e.g., correlation, redundancy, or information-theoretic measures). Synthetic variables or cluster representatives are then selected for downstream modeling, reducing variable redundancy and enhancing interpretability (Chavent et al., 2016, Spooner et al., 2022).
Sample Clustering: Partitions the data samples. In data-efficient learning, clustering in embedding space provides a mechanism to select “typical” or “representative” samples, with sensitivity sampling improving approximation guarantees for overall loss or label distribution (Axiotis et al., 27 Feb 2024).
Gradient-Space or Influence-Based Clustering: Pools training samples with similar gradient signatures, thus compressing influence estimation for fine-tuning large-scale models and enabling resource-aware selection (Wang et al., 12 Jun 2025).
Meta-Dataset Clustering: Clusters datasets themselves (e.g., using optimal transport distances) for zero-shot transfer of pipeline recommendations in AutoML for clustering (Singh et al., 15 Jul 2024).
Temporal or Topological Clustering: Applies clustering using non-standard domains, such as topological summaries of time series for robust financial index tracking and asset selection (Goel et al., 30 Jan 2024).

3. Methodological Structures and Algorithms

A variety of clustering algorithms and optimization paradigms underpin these frameworks; the methodological structure depends on the application domain and granularity of selection.

Substrate	Typical Clustering Methods	Post-Processing/Selection Strategies
Variables	Hierarchical clustering, PCAmix	Synthetic variable creation, random forest VI, univariate filtering
Samples	k-means, k-median, affinity propagation	Sensitivity or influence-based sampling, coreset construction
Gradient Features	k-means in cosine space	Modified UCB bandit allocation, budget-constrained exploration
Meta-datasets	Optimal transport (GW), meta-learning	Zero-shot transfer of optimal pipelines
Time series (TDA)	Affinity propagation on TDA kernels	Exemplar selection for portfolio optimization

In variable clustering frameworks (Chavent et al., 2016), variables are grouped via hierarchical clustering (using mixed-type similarity measures for numerical/categorical data), then “synthetic variables” are derived as principal components (e.g., by PCAmix) to reduce dimensionality and redundancy. A wrapper feature selection (e.g., VSURF with random forests) operates on these synthetic variables.

Sample clustering in high-dimensional embedding or latent spaces is central to data selection for model fine-tuning. For instance, (Axiotis et al., 27 Feb 2024) clusters embeddings using k-means, then uses a sensitivity-aware sampling scheme: sampling probabilities are proportional to the sum of a proxy loss and a distance-to-center term, with theoretical approximation guarantees on average loss. In (Wang et al., 12 Jun 2025), clustering is performed over gradient feature space, and a modified Upper Confidence Bound (UCB) bandit algorithm allocates sampling budget across clusters.

Meta-dataset clustering with Gromov Wasserstein (GW) distances enables zero-shot model selection in AutoML for clustering problems (Singh et al., 15 Jul 2024), where the optimal pipeline for the most similar dataset is recommended, and similarity is computed between embedded representations (φ) of datasets via scalable GW-LR approximations.

4. Practical Applications and Experimental Evidence

Clustering-based data selection frameworks have been empirically validated in a wide range of settings:

Fine-tuning foundation models: Sensitivity sampling after k-means clustering in embedding space enables efficient selection of diverse, loss-representative subsets for translation and classification tasks. Experimental results show comparable or superior accuracy with orders-of-magnitude fewer samples compared to uniform or diversity-based sampling, and near-optimality relative to leverage score sampling in regression (Axiotis et al., 27 Feb 2024).
Supervised, semi-supervised, and unsupervised feature selection: Frameworks such as clustering plus random forest variable importance (Chavent et al., 2016) and ensemble selection post-clustering (Spooner et al., 2022) have demonstrated stability improvements and maintained or improved predictive performance in clinical and proteomic datasets.
Data-efficient LLM fine-tuning: By clustering training samples in gradient space and allocating gradient computations via UCB, ClusterUCB achieves competitive accuracy to full-gradient selection methods while operating at a fraction of the computation cost, as evidenced on MMLU, GSM8k, TydiQA, and HumanEval (Wang et al., 12 Jun 2025).
Recommender systems: User clustering (via k-means or community detection) followed by per-cluster algorithm selection yields nDCG@10 improvements as high as 360% versus baseline global algorithm selection, as observed across diverse recommendation datasets (Lizenberger et al., 28 May 2024).
Automated pipeline selection in clustering (AutoML): CLAMS, via GW-LR-based dataset similarity, selects pipelines that statistically outperform baseline automated and manual clusterers on 57 OpenML datasets in terms of adjusted mutual information (Singh et al., 15 Jul 2024).
Sparse portfolio selection: Clustering of assets using topological data analysis-based distances leads to portfolios that maintain tight index tracking with low turnover and reduced asset count, robust even during COVID-19 market stress (Goel et al., 30 Jan 2024).

5. Theoretical Guarantees and Optimization

Several frameworks offer theoretical error or complexity guarantees:

Coreset-based frameworks: For Hölder-continuous losses and clustering with tight intra-cluster variance, the sensitivity sampling protocol yields a selected subset whose weighted average loss estimates the true dataset loss within (1 ± ε) multiplicative and ελΦₖ additive error, with Φₖ being the k-means cost (Axiotis et al., 27 Feb 2024).
Fixed-parameter tractability: Feature selection for clustering in categorical spaces (w.r.t. the Hamming distance) is shown to be fixed-parameter tractable with respect to the cost threshold B when the alphabet size and cluster number are constants (Bandyapadhyay et al., 2021).
Automated model selection: Meta-learning frameworks employing OT-based distances are statistically validated using AMI and Bayesian signed-rank tests with region of practical equivalence, achieving the top performance rank among evaluated models (Singh et al., 15 Jul 2024).

The selection process is often cast as a constrained optimization (e.g., budget allocation in a bandit setting (Wang et al., 12 Jun 2025), MDL minimization for subspace-cluster assignment (Leiber et al., 2023)), making explicit the trade-offs between exploration, exploitation, error, and computational resources.

6. Interpretability, Robustness, and Extensions

Interpretability: Clustering-based groupings (of features, samples, assets, or users) facilitate domain interpretation by associating importance or modeling decisions with coherent and often domain-relevant structures, as in clinical biomarkers (Spooner et al., 2022) and user segment-specific recommender configurations (Lizenberger et al., 28 May 2024).
Robustness: Temporal and topological clustering methods capture invariances or dynamics missed by simpler similarity measures, enhancing robustness under distributional shifts or stress (e.g., TDA-based clustering in index tracking (Goel et al., 30 Jan 2024)). Outlier-handling mechanisms in subspace clustering frameworks provide reliable cluster assignments even in the presence of significant contamination (Leiber et al., 2023).
Scalability and efficiency: Most frameworks emphasize linear or sublinear computational scaling with dataset size, via hierarchical clustering, approximate neighborhood methods, efficient gradient computations, or low-rank matrix approximations.

7. Challenges and Future Research

Open challenges include:

Choice of clustering parameters and interpretability of clusters: Most practical frameworks use data-driven or automatic methods (e.g., MDL principle (Leiber et al., 2023)), but robust selection for arbitrarily structured data remains complex.
Integration with supervised and semi-supervised tasks: While many frameworks are unsupervised, hybrid methodologies that use weak supervision (labels, proxies, or expert constraints) remain an area of active development.
Handling missing not at random data and complex data dependencies: Recent frameworks are being extended to simultaneously address variable selection and non-ignorable missingness in high-dimensional, highly structured domains (Ho et al., 25 May 2025).
Dynamic and online settings: Future research is likely to explore online adaptation of cluster-based selections as data distributions and model states evolve.
Meta-learning and transferability: Automatic meta-model selection across datasets using optimal transport or related metrics is a promising direction for generalizing clustering-based data selection to broader AutoML and transfer learning settings (Singh et al., 15 Jul 2024).

In summary, the clustering-based data selection framework comprises a diverse spectrum of methodologies that leverage the unsupervised structure in data to drive variable, sample, or algorithm selection, improving the efficiency, accuracy, and interpretability of machine learning and statistical analysis pipelines. These frameworks are supported by both strong empirical evidence across domains and, in several cases, rigorous theoretical guarantees.