Papers
Topics
Authors
Recent
Search
2000 character limit reached

Variance-Regularized K-means (VR-Kmeans)

Updated 17 February 2026
  • The algorithm's main contribution is integrating iterative outlier removal to minimize intra-cluster variance beyond standard K-means.
  • It employs a Chebyshev-type threshold for anomaly detection, ensuring robust and systematic exclusion of variance-inflating outliers.
  • Empirical evaluations show up to 88.1% variance reduction and significant improvements in clustering metrics on both synthetic and real-world datasets.

Variance-Regularized K-means (VR-Kmeans) is an enhanced clustering algorithm that augments the standard Lloyd-style K-means procedure with explicit intra-cluster variance reduction and integrated anomaly detection via iterative outlier pruning. The method systematically refines clusters by alternating between reassignment, statistical outlier exclusion based on within-cluster distances, and centroid updating, with the objective of minimizing the mean intra-cluster variance beyond that achievable by conventional K-means. Outliers are detected as points whose assignment causes a significant increase in cluster variance as determined by a Chebyshev-type threshold, and are permanently removed from the clustering process. Empirical evaluations on synthetic and real-world datasets demonstrate significant reductions in intra-cluster variance, improvements in unsupervised and supervised clustering quality metrics, and robust identification of anomalies (Shorewala et al., 30 May 2025).

1. Mathematical Formulation and Objective

Let X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d be the input data with NN samples in dd dimensions, and let C={c1,...,cc}C = \{c_1, ..., c_c\} denote the set of cc cluster centroids. The algorithm employs the Euclidean (2\ell_2) distance d(x,y)=xy2d(x, y) = \|x-y\|_2 for all clustering assignments.

After assignment, let Dj={xX:argminkd(x,ck)=j}D_j = \{x \in X : \arg\min_k d(x, c_k) = j\} be the set of points belonging to cluster jj with Dj=mj|D_j| = m_j. The sample variance within each cluster is

Vj=1mj1xDjxcj22.V_j = \frac{1}{m_j - 1} \sum_{x \in D_j} \|x - c_j\|_2^2.

The global average intra-cluster variance is

Vavg=1cj=1cVj=1cj=1c1mj1xDjxcj22.V_{\rm avg} = \frac{1}{c} \sum_{j=1}^c V_j = \frac{1}{c} \sum_{j=1}^c \frac{1}{m_j - 1} \sum_{x \in D_j} \|x - c_j\|_2^2.

The objective of VR-Kmeans is to iteratively minimize VavgV_{\rm avg} by clustering, outlier removal, and centroid update until stabilization.

2. Algorithmic Procedure

The core VR-Kmeans procedure consists of repeated cycles that alternate the following steps:

  1. Initialization: Centroids CC are initialized using standard K-means.
  2. Assignment: Each xiXYx_i \in X \setminus Y is assigned to its closest centroid, where YY is the set of removed anomalies.
  3. Variance Computation: For each cluster jj, the array of distances Dj(t)D_j^{(t)} from points in the cluster to centroid cjc_j is computed. The mean μj\mu_j and standard deviation σj\sigma_j of these distances are evaluated.
  4. Outlier Removal: Using the Chebyshev-type threshold τj=μj+k0σj\tau_j = \mu_j + k_0 \cdot \sigma_j (default k0=2k_0=2), all xDjx \in D_j with d(x,cj)τjd(x, c_j) \geq \tau_j are permanently removed from the cluster and added to YY.
  5. Centroid Update: For each jj, the centroid is updated as cj=1DjxDjxc_j = \frac{1}{|D_j|} \sum_{x \in D_j} x.
  6. Variance Update and Convergence: The new Vavg(t)V_{\rm avg}^{(t)} is computed. Iterations continue until Vavg(t)Vavg(t1)<ϵ|V_{\rm avg}^{(t)} - V_{\rm avg}^{(t-1)}| < \epsilon, a small pre-specified tolerance.

Upon termination, the returned outputs are the final centroids CC and the set of anomalies YY.

3. Hyper-parameters and Statistical Guarantees

VR-Kmeans introduces one primary hyper-parameter beyond standard K-means:

  • cc: Number of clusters (identical to K-means).
  • k0k_0: Outlier removal bound, specifying the number of standard deviations above the mean distance where points are considered anomalies (default k0=2k_0=2). By Chebyshev’s inequality, at most 1/k021/k_0^2 of the data in each cluster are removed per iteration, guaranteeing that at least 75%75\% of each cluster's points are retained when k0=2k_0=2. Larger k0k_0 values reduce the aggressiveness of outlier pruning.
  • ϵ\epsilon: Convergence threshold for the change in VavgV_{\rm avg}.
  • Distance Metric: Fixed to 2\ell_2, though, in principle, any Minkowski-type distance may be used.

The use of Chebyshev’s inequality formalizes the robustness of the anomaly threshold, preventing excessive data removal during each iteration.

4. Anomaly Detection Mechanism

Anomaly detection is inherent to the VR-Kmeans algorithm. After each assignment step, any xx such that d(x,cj)μj+k0σjd(x, c_j) \geq \mu_j + k_0 \cdot \sigma_j within its assigned cluster jj is flagged as an outlier. These points are immediately and permanently excluded from further clustering and are aggregated into the anomaly set YY. This procedure detects both local (relative to cluster) and global outliers and does not require prior labeling. The anomaly score is implicitly determined by deviation from the cluster mean distance rather than any global distance criterion.

5. Empirical Performance and Evaluation Metrics

Empirical results on both synthetic and real (UCI) datasets demonstrate sharp improvements in variance-reduction and clustering efficacy over standard K-means (Shorewala et al., 30 May 2025).

Dataset Outliers Removed Variance Reduction Davies-Bouldin Δ Silhouette Δ Calinski-Harabasz Δ Accuracy Δ F1 Score Δ Jaccard Δ V-measure Δ
Synthetic 2D 7.5% ↓18.7% ↓13.9% ↑9.44% ↑6.95%
UCI WBC 10.1% ↓57.9% ↓11.5% ↑11.0% ↑31.7% ↑1.95% ↑1.73% ↑3.34% ↑9.95%
UCI Wine Quality 7.6% ↓88.1% ↓12.6% ↑8.2% ↑39.4% ↑22.5% ↑20.8% ↑22.5% ↑78.6%

Key:

  • Intrinsic metrics: variance, Davies-Bouldin, Silhouette, Calinski-Harabasz
  • Extrinsic metrics: accuracy, F1, Jaccard, V-measure (where true labels permit comparison)

Across settings, VR-Kmeans achieves variance drops between 18.7%18.7\% and 88.1%88.1\%, with corresponding substantial improvements in cluster quality (Silhouette increase up to 11.0%11.0\%; Calinski-Harabasz increase up to 39.4%39.4\%; Davies–Bouldin reduced by up to 13.9%13.9\%). When external labels are available, accuracy and F1 score improvements reach 22.5%22.5\% and 20.8%20.8\% respectively.

6. Convergence and Computational Complexity

Convergence of VR-Kmeans is ensured empirically by the monotonic reduction of VavgV_{\rm avg} until the incremental improvement falls below threshold ϵ\epsilon, that is,

limtVavg(t)Vavg(t1)=0.\lim_{t\to\infty} |V_{\rm avg}^{(t)} - V_{\rm avg}^{(t-1)}| = 0.

No formal proof of achieving the global minimum of average variance is provided. Empirically, stabilization is typically reached within a few dozen iterations.

Each iteration matches the computational cost per iteration of standard K-means:

  • Assignment step: O(Ncd)\mathcal{O}(Ncd)
  • Centroid recomputation: O(Nd)\mathcal{O}(Nd)
  • Mean/std computation for each cluster: O(N)\mathcal{O}(N)

Overall runtime is TO(Ncd)T\cdot\mathcal{O}(Ncd) for TT iterations (typically T<50T < 50). If intrinsic metrics such as the Silhouette coefficient are computed at each stage, additional O(N2)\mathcal{O}(N^2) overhead may result.

7. Comparative Advantages and Summary

VR-Kmeans achieves a reduction in intra-cluster variance beyond the Lloyd-style local optimum of vanilla K-means by systematic “pruning” of variance-inflating outliers. It maintains full compatibility with the K-means workflow but adds only the single hyper-parameter k0k_0 to control anomaly sensitivity. The algorithm’s built-in anomaly detection returns an explicit set of outliers without recourse to external modules or priors. Improvements in both unsupervised metrics (variance, internal indices) and supervised ones (label agreement) are consistently observed on synthetic and real datasets. The approach is a direct extension of K-means, readily adaptable for robust clustering and integrated anomaly detection applications (Shorewala et al., 30 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variance-Regularized K-means (VR-Kmeans).