Variance-Regularized K-means (VR-Kmeans)

Updated 17 February 2026

The algorithm's main contribution is integrating iterative outlier removal to minimize intra-cluster variance beyond standard K-means.
It employs a Chebyshev-type threshold for anomaly detection, ensuring robust and systematic exclusion of variance-inflating outliers.
Empirical evaluations show up to 88.1% variance reduction and significant improvements in clustering metrics on both synthetic and real-world datasets.

Variance-Regularized K-means (VR-Kmeans) is an enhanced clustering algorithm that augments the standard Lloyd-style K-means procedure with explicit intra-cluster variance reduction and integrated anomaly detection via iterative outlier pruning. The method systematically refines clusters by alternating between reassignment, statistical outlier exclusion based on within-cluster distances, and centroid updating, with the objective of minimizing the mean intra-cluster variance beyond that achievable by conventional K-means. Outliers are detected as points whose assignment causes a significant increase in cluster variance as determined by a Chebyshev-type threshold, and are permanently removed from the clustering process. Empirical evaluations on synthetic and real-world datasets demonstrate significant reductions in intra-cluster variance, improvements in unsupervised and supervised clustering quality metrics, and robust identification of anomalies (Shorewala et al., 30 May 2025).

1. Mathematical Formulation and Objective

Let $X = \{x_1, ..., x_N\} \subset \mathbb{R}^d$ be the input data with $N$ samples in $d$ dimensions, and let $C = \{c_1, ..., c_c\}$ denote the set of $c$ cluster centroids. The algorithm employs the Euclidean ( $\ell_2$ ) distance $d(x, y) = \|x-y\|_2$ for all clustering assignments.

After assignment, let $D_j = \{x \in X : \arg\min_k d(x, c_k) = j\}$ be the set of points belonging to cluster $j$ with $|D_j| = m_j$ . The sample variance within each cluster is

$V_j = \frac{1}{m_j - 1} \sum_{x \in D_j} \|x - c_j\|_2^2.$

The global average intra-cluster variance is

$V_{\rm avg} = \frac{1}{c} \sum_{j=1}^c V_j = \frac{1}{c} \sum_{j=1}^c \frac{1}{m_j - 1} \sum_{x \in D_j} \|x - c_j\|_2^2.$

The objective of VR-Kmeans is to iteratively minimize $V_{\rm avg}$ by clustering, outlier removal, and centroid update until stabilization.

2. Algorithmic Procedure

The core VR-Kmeans procedure consists of repeated cycles that alternate the following steps:

Initialization: Centroids $C$ are initialized using standard K-means.
Assignment: Each $x_i \in X \setminus Y$ is assigned to its closest centroid, where $Y$ is the set of removed anomalies.
Variance Computation: For each cluster $j$ , the array of distances $D_j^{(t)}$ from points in the cluster to centroid $c_j$ is computed. The mean $\mu_j$ and standard deviation $\sigma_j$ of these distances are evaluated.
Outlier Removal: Using the Chebyshev-type threshold $\tau_j = \mu_j + k_0 \cdot \sigma_j$ (default $k_0=2$ ), all $x \in D_j$ with $d(x, c_j) \geq \tau_j$ are permanently removed from the cluster and added to $Y$ .
Centroid Update: For each $j$ , the centroid is updated as $c_j = \frac{1}{|D_j|} \sum_{x \in D_j} x$ .
Variance Update and Convergence: The new $V_{\rm avg}^{(t)}$ is computed. Iterations continue until $|V_{\rm avg}^{(t)} - V_{\rm avg}^{(t-1)}| < \epsilon$ , a small pre-specified tolerance.

Upon termination, the returned outputs are the final centroids $C$ and the set of anomalies $Y$ .

3. Hyper-parameters and Statistical Guarantees

VR-Kmeans introduces one primary hyper-parameter beyond standard K-means:

$c$ : Number of clusters (identical to K-means).
$k_0$ : Outlier removal bound, specifying the number of standard deviations above the mean distance where points are considered anomalies (default $k_0=2$ ). By Chebyshev’s inequality, at most $1/k_0^2$ of the data in each cluster are removed per iteration, guaranteeing that at least $75\%$ of each cluster's points are retained when $k_0=2$ . Larger $k_0$ values reduce the aggressiveness of outlier pruning.
$\epsilon$ : Convergence threshold for the change in $V_{\rm avg}$ .
Distance Metric: Fixed to $\ell_2$ , though, in principle, any Minkowski-type distance may be used.

The use of Chebyshev’s inequality formalizes the robustness of the anomaly threshold, preventing excessive data removal during each iteration.

4. Anomaly Detection Mechanism

Anomaly detection is inherent to the VR-Kmeans algorithm. After each assignment step, any $x$ such that $d(x, c_j) \geq \mu_j + k_0 \cdot \sigma_j$ within its assigned cluster $j$ is flagged as an outlier. These points are immediately and permanently excluded from further clustering and are aggregated into the anomaly set $Y$ . This procedure detects both local (relative to cluster) and global outliers and does not require prior labeling. The anomaly score is implicitly determined by deviation from the cluster mean distance rather than any global distance criterion.

5. Empirical Performance and Evaluation Metrics

Empirical results on both synthetic and real (UCI) datasets demonstrate sharp improvements in variance-reduction and clustering efficacy over standard K-means (Shorewala et al., 30 May 2025).

Dataset	Outliers Removed	Variance Reduction	Davies-Bouldin Δ	Silhouette Δ	Calinski-Harabasz Δ	Accuracy Δ	F1 Score Δ	Jaccard Δ	V-measure Δ
Synthetic 2D	7.5%	↓18.7%	↓13.9%	↑9.44%	↑6.95%	—	—	—	—
UCI WBC	10.1%	↓57.9%	↓11.5%	↑11.0%	↑31.7%	↑1.95%	↑1.73%	↑3.34%	↑9.95%
UCI Wine Quality	7.6%	↓88.1%	↓12.6%	↑8.2%	↑39.4%	↑22.5%	↑20.8%	↑22.5%	↑78.6%

Key:

Intrinsic metrics: variance, Davies-Bouldin, Silhouette, Calinski-Harabasz
Extrinsic metrics: accuracy, F1, Jaccard, V-measure (where true labels permit comparison)

Across settings, VR-Kmeans achieves variance drops between $18.7\%$ and $88.1\%$ , with corresponding substantial improvements in cluster quality (Silhouette increase up to $11.0\%$ ; Calinski-Harabasz increase up to $39.4\%$ ; Davies–Bouldin reduced by up to $13.9\%$ ). When external labels are available, accuracy and F1 score improvements reach $22.5\%$ and $20.8\%$ respectively.

6. Convergence and Computational Complexity

Convergence of VR-Kmeans is ensured empirically by the monotonic reduction of $V_{\rm avg}$ until the incremental improvement falls below threshold $\epsilon$ , that is,

$\lim_{t\to\infty} |V_{\rm avg}^{(t)} - V_{\rm avg}^{(t-1)}| = 0.$

No formal proof of achieving the global minimum of average variance is provided. Empirically, stabilization is typically reached within a few dozen iterations.

Each iteration matches the computational cost per iteration of standard K-means:

Assignment step: $\mathcal{O}(Ncd)$
Centroid recomputation: $\mathcal{O}(Nd)$
Mean/std computation for each cluster: $\mathcal{O}(N)$

Overall runtime is $T\cdot\mathcal{O}(Ncd)$ for $T$ iterations (typically $T < 50$ ). If intrinsic metrics such as the Silhouette coefficient are computed at each stage, additional $\mathcal{O}(N^2)$ overhead may result.

7. Comparative Advantages and Summary

VR-Kmeans achieves a reduction in intra-cluster variance beyond the Lloyd-style local optimum of vanilla K-means by systematic “pruning” of variance-inflating outliers. It maintains full compatibility with the K-means workflow but adds only the single hyper-parameter $k_0$ to control anomaly sensitivity. The algorithm’s built-in anomaly detection returns an explicit set of outliers without recourse to external modules or priors. Improvements in both unsupervised metrics (variance, internal indices) and supervised ones (label agreement) are consistently observed on synthetic and real datasets. The approach is a direct extension of K-means, readily adaptable for robust clustering and integrated anomaly detection applications (Shorewala et al., 30 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Anomaly Detection and Improvement of Clusters using Enhanced K-Means Algorithm (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variance-Regularized K-means (VR-Kmeans).

Variance-Regularized K-means (VR-Kmeans)

1. Mathematical Formulation and Objective

2. Algorithmic Procedure

3. Hyper-parameters and Statistical Guarantees

4. Anomaly Detection Mechanism

5. Empirical Performance and Evaluation Metrics

6. Convergence and Computational Complexity

7. Comparative Advantages and Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Variance-Regularized K-means (VR-Kmeans)

1. Mathematical Formulation and Objective

2. Algorithmic Procedure

3. Hyper-parameters and Statistical Guarantees

4. Anomaly Detection Mechanism

5. Empirical Performance and Evaluation Metrics

6. Convergence and Computational Complexity

7. Comparative Advantages and Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research