Variance-Regularized K-means (VR-Kmeans)
- The algorithm's main contribution is integrating iterative outlier removal to minimize intra-cluster variance beyond standard K-means.
- It employs a Chebyshev-type threshold for anomaly detection, ensuring robust and systematic exclusion of variance-inflating outliers.
- Empirical evaluations show up to 88.1% variance reduction and significant improvements in clustering metrics on both synthetic and real-world datasets.
Variance-Regularized K-means (VR-Kmeans) is an enhanced clustering algorithm that augments the standard Lloyd-style K-means procedure with explicit intra-cluster variance reduction and integrated anomaly detection via iterative outlier pruning. The method systematically refines clusters by alternating between reassignment, statistical outlier exclusion based on within-cluster distances, and centroid updating, with the objective of minimizing the mean intra-cluster variance beyond that achievable by conventional K-means. Outliers are detected as points whose assignment causes a significant increase in cluster variance as determined by a Chebyshev-type threshold, and are permanently removed from the clustering process. Empirical evaluations on synthetic and real-world datasets demonstrate significant reductions in intra-cluster variance, improvements in unsupervised and supervised clustering quality metrics, and robust identification of anomalies (Shorewala et al., 30 May 2025).
1. Mathematical Formulation and Objective
Let be the input data with samples in dimensions, and let denote the set of cluster centroids. The algorithm employs the Euclidean () distance for all clustering assignments.
After assignment, let be the set of points belonging to cluster with . The sample variance within each cluster is
The global average intra-cluster variance is
The objective of VR-Kmeans is to iteratively minimize by clustering, outlier removal, and centroid update until stabilization.
2. Algorithmic Procedure
The core VR-Kmeans procedure consists of repeated cycles that alternate the following steps:
- Initialization: Centroids are initialized using standard K-means.
- Assignment: Each is assigned to its closest centroid, where is the set of removed anomalies.
- Variance Computation: For each cluster , the array of distances from points in the cluster to centroid is computed. The mean and standard deviation of these distances are evaluated.
- Outlier Removal: Using the Chebyshev-type threshold (default ), all with are permanently removed from the cluster and added to .
- Centroid Update: For each , the centroid is updated as .
- Variance Update and Convergence: The new is computed. Iterations continue until , a small pre-specified tolerance.
Upon termination, the returned outputs are the final centroids and the set of anomalies .
3. Hyper-parameters and Statistical Guarantees
VR-Kmeans introduces one primary hyper-parameter beyond standard K-means:
- : Number of clusters (identical to K-means).
- : Outlier removal bound, specifying the number of standard deviations above the mean distance where points are considered anomalies (default ). By Chebyshev’s inequality, at most of the data in each cluster are removed per iteration, guaranteeing that at least of each cluster's points are retained when . Larger values reduce the aggressiveness of outlier pruning.
- : Convergence threshold for the change in .
- Distance Metric: Fixed to , though, in principle, any Minkowski-type distance may be used.
The use of Chebyshev’s inequality formalizes the robustness of the anomaly threshold, preventing excessive data removal during each iteration.
4. Anomaly Detection Mechanism
Anomaly detection is inherent to the VR-Kmeans algorithm. After each assignment step, any such that within its assigned cluster is flagged as an outlier. These points are immediately and permanently excluded from further clustering and are aggregated into the anomaly set . This procedure detects both local (relative to cluster) and global outliers and does not require prior labeling. The anomaly score is implicitly determined by deviation from the cluster mean distance rather than any global distance criterion.
5. Empirical Performance and Evaluation Metrics
Empirical results on both synthetic and real (UCI) datasets demonstrate sharp improvements in variance-reduction and clustering efficacy over standard K-means (Shorewala et al., 30 May 2025).
| Dataset | Outliers Removed | Variance Reduction | Davies-Bouldin Δ | Silhouette Δ | Calinski-Harabasz Δ | Accuracy Δ | F1 Score Δ | Jaccard Δ | V-measure Δ |
|---|---|---|---|---|---|---|---|---|---|
| Synthetic 2D | 7.5% | ↓18.7% | ↓13.9% | ↑9.44% | ↑6.95% | — | — | — | — |
| UCI WBC | 10.1% | ↓57.9% | ↓11.5% | ↑11.0% | ↑31.7% | ↑1.95% | ↑1.73% | ↑3.34% | ↑9.95% |
| UCI Wine Quality | 7.6% | ↓88.1% | ↓12.6% | ↑8.2% | ↑39.4% | ↑22.5% | ↑20.8% | ↑22.5% | ↑78.6% |
Key:
- Intrinsic metrics: variance, Davies-Bouldin, Silhouette, Calinski-Harabasz
- Extrinsic metrics: accuracy, F1, Jaccard, V-measure (where true labels permit comparison)
Across settings, VR-Kmeans achieves variance drops between and , with corresponding substantial improvements in cluster quality (Silhouette increase up to ; Calinski-Harabasz increase up to ; Davies–Bouldin reduced by up to ). When external labels are available, accuracy and F1 score improvements reach and respectively.
6. Convergence and Computational Complexity
Convergence of VR-Kmeans is ensured empirically by the monotonic reduction of until the incremental improvement falls below threshold , that is,
No formal proof of achieving the global minimum of average variance is provided. Empirically, stabilization is typically reached within a few dozen iterations.
Each iteration matches the computational cost per iteration of standard K-means:
- Assignment step:
- Centroid recomputation:
- Mean/std computation for each cluster:
Overall runtime is for iterations (typically ). If intrinsic metrics such as the Silhouette coefficient are computed at each stage, additional overhead may result.
7. Comparative Advantages and Summary
VR-Kmeans achieves a reduction in intra-cluster variance beyond the Lloyd-style local optimum of vanilla K-means by systematic “pruning” of variance-inflating outliers. It maintains full compatibility with the K-means workflow but adds only the single hyper-parameter to control anomaly sensitivity. The algorithm’s built-in anomaly detection returns an explicit set of outliers without recourse to external modules or priors. Improvements in both unsupervised metrics (variance, internal indices) and supervised ones (label agreement) are consistently observed on synthetic and real datasets. The approach is a direct extension of K-means, readily adaptable for robust clustering and integrated anomaly detection applications (Shorewala et al., 30 May 2025).