EA-Star: Evolutionary Clustering
- EA-Star Algorithm is an ensemble metaheuristic that combines BSA-driven search, Lévy-flight mutations, and quartile-based pruning to achieve robust clustering of heterogeneous data.
- It employs social-class ranking to guide candidate solutions, reducing sensitivity to dataset variations and accurately identifying the true number of clusters.
- Empirical evaluations on 32 benchmark datasets demonstrate lower SSE and higher NMI, underscoring EA-Star’s superior performance over conventional clustering methods.
The Evolutionary Clustering Algorithm Star (ECA) is an ensemble metaheuristic designed for robust clustering of heterogeneous datasets, integrating evolutionary optimization with explicit mechanisms for exploration, exploitation, and cluster validity. ECA is constructed atop the Backtracking Search Optimization Algorithm (BSA) and augments the search with social-class ranking, Lévy-flight mutations, quartile-based outlier pruning, and Euclidean assignment. Empirical evaluation demonstrates superior performance in identifying the true number of clusters and low sensitivity to varied dataset features compared with contemporary clustering algorithms (Hassan et al., 2021).
1. Algorithmic Structure and Evolutionary Components
ECA* encodes each candidate solution as a set of cluster centroids in -dimensional space, represented as . The population is initialized with random centroids sampled uniformly within the data range. For each generation, fitness is calculated via within-cluster sum of squares (SSE), while additional metrics—normalized MSE (nMSE) and approximation ratio (-ratio)—are tracked for validation.
Selection is governed by social-class ranking: the population is divided into “high class” and “low class” based on SSE fitness. High-class individuals guide the generation of trial solutions for low-class members, promoting elitist bias without stagnation.
Variation leverages both standard BSA differential operations and Lévy-flight perturbations. The evolutionary update for an individual is:
where is a scaling factor and indices , are drawn randomly. For a designated fraction of poorly performing individuals, Lévy-flight-based mutations are invoked:
to inject heavy-tailed stochastic variation.
Quartile-based pruning eliminates centroids with distances exceeding to suppress spurious clusters. Environmental selection retains the top candidates (lowest SSE) for the next generation. Key parameters used are , , , and .
2. Objective Functions and Cluster Validity Measures
ECA* optimizes internal objective functions devoid of ground-truth: SSE and nMSE, with -ratio employed when theoretical optimum is available. For external validation, the following indices are computed:
- Centroid Index (CI): quantifies missing/extra centroids, facilitating cluster enumeration accuracy.
- Centroid Similarity Index (CSI): measures the proportion of matched cluster assignments.
- Normalized Mutual Information (NMI): quantifies the agreement between found partition and reference partition :
3. Experimental Methodology
ECA* and five reference algorithms—GENCLUST++ (GA + K-means hybrid), LVQ (Learning Vector Quantization), EM (Expectation Maximization for Gaussian Mixtures), K-means++, and classical K-means—were compared on 32 benchmark datasets. These datasets encompass synthetic sets (S1–S4), real-world data (A1–A3), classical shape benchmarks (Birch1, Unbalance, Aggregation, Compound, Pathbased, D31, R15, Jain, Flame), high-dimensional sets (dim032 to dim1024), and Gaussian clouds with variable overlap/dimensionality (g2-16-10 to g2-1024-100). Each algorithm was run 30 times per dataset using Java for ECA* and Weka for competitors.
Five prominent dataset features characterized the experimental landscape:
- Number of clusters
- Cluster dimensionality
- Cluster overlap
- Cluster shape
- Cluster structure
Validation comprised both internal (SSE, nMSE, -ratio) and external (CI, CSI, NMI) metrics.
4. Empirical Results and Sensitivity Analysis
Aggregate results—sampled from four datasets (S1, Dim-128, g2-16-60, Flame)—are summarized:
| Dataset | Alg. | SSE | nMSE | NMI | CI |
|---|---|---|---|---|---|
| S1 | ECA* | 1.02×10¹³ | 1.02×10⁹ | 0.986 | 0.01 |
| Dim-128 | ECA* | 2.32×10¹¹ | 7.26×10⁶ | 0.998 | 0.00 |
| g2-16-60 | ECA* | 1.17×10¹¹ | 3.56×10³ | 0.997 | 0.00 |
| Flame | ECA* | 3.19×10⁶ | 6.65×10⁰ | 0.919 | 0.00 |
ECA* outperformed alternatives regarding cluster recovery (CI ≈ 0), SSE, nMSE, and NMI across the majority of datasets.
Sensitivity analysis was performed via a “performance rating” framework: algorithms were ranked (1–6) for each dataset feature and validation measure, then averaged. Resultant grand average ranks were:
ECA* remains consistently less sensitive to variations in shape, overlap, dimensionality, structure, and cardinality. EM and KM++ follow, with LVQ and GENCLUST++ exhibiting least robustness.
5. Strengths, Limitations, and Prospective Directions
ECA* demonstrates strength in discovering the correct number of clusters (low CI), achieving high internal and external validation scores, and maintaining stability against dataset heterogeneity. The algorithm design, combining BSA-driven search, Lévy exploration, and quartile-based pruning, delivers robust exploration–exploitation balance.
Notable limitations include reliance on prior knowledge (e.g., search-space bounds for centroid initialization), limited real-world scalability (benchmarks of moderate size), and user-dependent hyperparameter configuration.
Future research directives consist of developing fully automatic or self-adaptive parameter mechanisms, deploying ECA* to large-scale applications (text, bioinformatics), exploring integration with deep learning for semi-supervised clustering, and investigating theoretical convergence behavior of the BSA–Lévy–pruning approach.
6. Relation to Evolutionary and Hybrid Clustering Literature
ECA* inherits its metaheuristic backbone from the Backtracking Search Optimization Algorithm (BSA). Unlike traditional clustering algorithms (KM, KM++, EM), ECA* integrates population-based optimization, social elitism, heavy-tailed exploration, and cluster quality pruning. The comparative inclusion of GENCLUST++ (GA/K-means hybrid) and LVQ situates ECA* within the broader ensemble and evolutionary clustering domain, highlighting its distinctive mechanism for reducing sensitivity and enhancing cluster enumeration accuracy.
A plausible implication is that ECA* serves as a template for ensemble metaheuristics that are resilient to dataset idiosyncrasies, providing comparative robustness not obtained via classical or simplistic evolutionary approaches.
7. Implementation and Benchmarking Practices
ECA* was implemented in Java with explicit parameter settings. Benchmarks utilized 30-fold repetitions for reliability, with all reference algorithms executed under identical conditions via the Weka framework. Validation adhered to established internal and external measures.
This rigorous methodology substantiates ECA*’s empirical superiority and paves the way for benchmark-driven development in clustering algorithm research. Further applications might necessitate adaptation for unsupervised contexts where prior knowledge is unavailable; prospective algorithmic enhancements may focus on this regime.