Clustering-Based Approaches Overview
- Clustering-Based Approaches are algorithmic frameworks that partition data into homogeneous subgroups based on similarity measures and latent structures.
- They integrate paradigms such as prototype-, connectivity-, density-, probabilistic- and deep learning-based models to enable versatile applications like pattern analysis and knowledge extraction.
- These methods also serve as critical preprocessing steps for supervised or symbolic tasks, thereby enhancing model interpretability and subsequent analysis.
Clustering-based approaches encompass a heterogeneous family of algorithmic and statistical frameworks that partition data into homogeneous subgroups (clusters) based on some notion of similarity or latent structure. These methods underpin a vast array of tasks in pattern analysis, exploratory data mining, knowledge extraction, and scalable learning, with rigorous developments in prototype-based, connectivity-based, density-based, probabilistic model-based, and modern deep learning paradigms. Clusterings may serve not only as end goals but as preprocessing steps toward further supervised or symbolic tasks. This article surveys foundational models, general algorithmic strategies, key theoretical constructs, and advanced extensions across representative domains.
1. Foundational Models and Taxonomy
Clustering methods are classically organized along three main paradigms: 1) Prototype-based, where clusters are indexed by representatives (e.g., k-means centroids, mixture model components); 2) Connectivity-based, leveraging pairwise similarities and graph-theoretic community structure; 3) Density-based, relying on regions of data space with high sample concentration.
Formally, let , , and let denote a partition. Assignments may be hard or soft, as in finite mixture models, where each point is associated with a membership vector.
Clustering frameworks can be further distinguished by:
- Objective Functionality (e.g., total within-cluster variance, penalized likelihood, entropy);
- Assignment Form (crisp vs. fuzzy);
- Representation (vectorial, sequence/curve, set/multiset, graph/network, etc.);
- Supervision level (unsupervised, semi-supervised via constraints, or selection-based).
2. Prototype- and Model-Based Approaches
2.1. k-means and Gradient-Based Schemes
The prototypical center-based approach minimizes the squared spread:
where is the cluster assignment for . This is solved by Lloyd's iterative algorithm, alternating assignment and centroid updates. Extensions to broader classes of cost functions have been formalized via gradient-based clustering (Armacki et al., 2022), allowing the replacement of cost with Bregman divergences or the Huber loss:
This generalization admits robust and flexible alternatives, encompassing -medians and other non-Euclidean metrics, while guaranteeing convergence under mild smoothness and monotonicity.
2.2. Model-Based Clustering: Mixture Models and Penalized EM
The model-based viewpoint assumes data are generated from a finite mixture:
and parameters are typically inferred via the Expectation-Maximization (EM) algorithm. Initialization sensitivity and unknown are classic problems. For curve (functional) data, robust penalty-based EM variants (Chamroukhi, 2013) incorporate entropy regularization:
with the observed-data log-likelihood, the mixing entropy, and a regularization parameter. This induces automatic pruning of under-supported clusters by making mixing weights for small clusters contract toward zero iteratively. The penalized EM variants reliably drive the model toward compact and well-supported partitions, adapting upward or downward during optimization.
Table: Key Steps in Robust Penalized EM for Regression Mixtures (Curves)
| Step | Operation/Formula | Role |
|---|---|---|
| E-step | Soft responsibility for curve | |
| M-step | Prevents proliferation of small clusters | |
| M-step | Weighted regression for curves | |
| M-step | Update by weighted residual mean square | Variance within cluster |
This joint approach, with entropic pruning, eliminates the need for ad hoc initialization schemes or standalone model-selection criteria.
3. Connectivity- and Similarity-Based Approaches
3.1. Graph and Community-based Clustering
Viewing objects as vertices in a weighted graph, with edge weights derived from similarity functions (possibly Chebyshev or Manhattan), clustering may be recast as a community identification problem (Rodrigues et al., 2011). Leading methods include:
- Fast-greedy modularity optimization (Clauset-Newman-Moore),
- Extremal Optimization,
- Walktrap based on random walks.
Automatic selection is achieved by maximizing modularity over merges:
making these approaches robust to cluster shape heterogeneity, nonconvexity, and overlap.
3.2. Spectral and Agglomerative Methods
Spectral clustering proceeds via affinity graph construction and Laplacian embedding. Agglomerative schemes, including average-link and single-link, yield multiresolution dendrograms but are sensitive to linkage selection and scale with (distance matrix) or (naive computation).
4. Density-, Set- and Structure-Aware Approaches
4.1. Density- and Shape-Preserving Distributed Clustering
Density-based clustering, notably DBSCAN and its distributed extensions (Le-Khac et al., 2017, Bendechache et al., 2017), assign clusters as density-connected regions. Communication-efficient distributed schemes may extract cluster boundaries (via balance vectors and cone-based predicates) and merge contours globally, achieving sensitive recovery of complex shapes with minimal bandwidth.
4.2. Point Pattern (Set) and Random Finite Set Clustering
For data comprising multi-sets or point patterns, nonparametric dissimilarity-based clustering (using Hausdorff, Wasserstein, or OSPA metrics) is robust to variable cardinality and supports streaming contexts (Tran et al., 2017). Model-based alternatives treat each cluster as a random finite set (RFS) mixture fitted by EM, parameterizing both cardinality and feature distributions. These approaches yield interpretable bag-level models and facilitate accurate assignment in domains such as spatial point processes and multi-instance learning.
5. Semi-Supervised, Consensus, and Selection-Based Extensions
5.1. Constraint-Based Cluster Selection
Incorporating weak supervision, constraint-based selection frameworks (COBS) utilize pairwise must-link/cannot-link information not by altering underlying objectives, but as an external oracle that selects from a library of unsupervised candidate clusterings (Craenendonck et al., 2016). This surprisingly simple approach systematically improves ARI on UCI benchmarks, outperforming bespoke metric-learning or assignment-adapting algorithms.
5.2. Consensus Clustering and Automated Calibration
Consensus clustering with subsampling and co-assignment matrices addresses stability and feature-relevance by calibrating both the number of clusters and regularization strength via an explicit, binomial-model-based -score measuring the separation of within- and between-cluster agreements (Bodinier et al., 2023). Attribute weighting (using COSA or sparse hierarchical methods) enables robust focus on informative features even in high-dimensional settings, with calibration and clustering bundled in scalable pipelines (as in the R package sharp).
6. Advanced and Specialized Methodologies
6.1. Deep and Sequence Clustering
Recurrent deep divergence-based clustering (RDDC) (Trosten et al., 2018) facilitates joint feature learning and clustering of variable-length time series. RDDC processes each sequence through a bidirectional GRU, producing fixed-size embeddings with softmax cluster assignments, and optimizes a loss based on the Cauchy–Schwarz divergence, cluster orthogonality, and simplex compactness. This approach outperforms classical time-averaging or cropping methods, attaining perfect NMI on benchmark sequence datasets.
6.2. Hierarchical and Multiset-Based Clustering in Feature Spaces
Agglomerative clustering based on the coincidence similarity index (Benatti et al., 11 Jul 2024) (combining scaled Jaccard and overlap/"interiority" indices) extends traditional methods to proportional feature spaces, providing strong discrimination, invariance to scale, and robustness to outliers. This multiset-driven distance simultaneously controls selectivity (via exponent ) and regularizes small-sample behavior (via offset ). Empirical comparisons on synthetic datasets show that coincidence-based linkage maintains high clustering accuracy and avoids spurious splits in both uniform and proportional spaces, outperforming uniform -based and Ward's methods in proportional scenarios.
6.3. Clustering as Preprocessing for Symbolic Knowledge Extraction
Clustering serves as a precursor to rule extraction in high-dimensional, black-box regression applications (Sabbatini et al., 2022), where regions are first determined in the input space by deep clustering, then wrapped by minimal bounding hypercubes for interpretability. This framework addresses scalability and fidelity limitations of grid- or cube-expansion-only extractors and sets the stage for aligned, compact, and readable rule sets for knowledge distillation from opaque models.
7. Evaluation Frameworks and Comparative Studies
Empirical evaluations of clustering methods (e.g., (Rodriguez et al., 2016, Kapp-Joswig et al., 2022)) stress the diversity of scenarios and the need for flexible, data-driven hyperparameter selection. Spectral clustering and model-based approaches frequently excel in higher dimensions or with subspace-structured clusters, while simple k-means remains effective in well-separated, isotropic scenarios but can be improved via initialization and gradient-based updates. Internal indices (silhouette, Calinski–Harabasz), information-theoretic, and bootstrapped quadratic scoring approaches serve as robust model selection criteria (Coraggio et al., 2021), addressing the need for objective selection and reliable partition validation across families.
Table: Algorithmic Families and Practical Properties
| Methodological Family | Key Features | Best-Use Scenarios |
|---|---|---|
| Prototype-based | Centroid/minimum-variance optimization | Spherical, well-separated clusters |
| Graph-/community-based | Modularity, automatic selection | Non-convex/unknown |
| Density-based | Arbitrary shapes, noise handling | Spatial/irregular data |
| Model-based (probabilistic) | Mixture density, soft membership | Overlapping/structured data |
| Deep/sequence-based | End-to-end joint embedding/clustering | Variable-length/time series |
| Multiset/coincidence-based | Scale-invariant, strict discrimination | Ratio/proportional feature data |
Conclusion
Clustering-based approaches have diversified into a spectrum of rigorously structured methodologies capable of handling data types ranging from Euclidean vectors to sets, curves, time series, and large unstructured corpora. The field has shifted from reliance on heuristic assignment and simple variance minimization to adaptive, penalized, probabilistic, consensus-driven, semi-supervised, and high-dimensionality-resilient techniques. Current research emphasizes initialization-insensitive, automatically calibrated, and structure-preserving methods, as well as clustering as a foundational module for interpretability and knowledge extraction. The breadth of algorithmic strategies available enables practitioners to tailor clustering to the statistical and operational properties of their specific data and downstream analytic requirements.