Radius-Guided Post-Clustering for Shape-Aware, Scalable Refinement of k-Means Results (2504.20293v1)

Published 28 Apr 2025 in cs.LG

Abstract: Traditional k-means clustering underperforms on non-convex shapes and requires the number of clusters k to be specified in advance. We propose a simple geometric enhancement: after standard k-means, each cluster center is assigned a radius (the distance to its farthest assigned point), and clusters whose radii overlap are merged. This post-processing step loosens the requirement for exact k: as long as k is overestimated (but not excessively), the method can often reconstruct non-convex shapes through meaningful merges. We also show that this approach supports recursive partitioning: clustering can be performed independently on tiled regions of the feature space, then globally merged, making the method scalable and suitable for distributed systems. Implemented as a lightweight post-processing step atop scikit-learn's k-means, the algorithm performs well on benchmark datasets, achieving high accuracy with minimal additional computation.

Summary

Radius-Guided Post-Clustering for Improved k-Means Results

The paper "Radius-Guided Post-Clustering for Shape-Aware, Scalable Refinement of k-Means Results" addresses notable limitations of the traditional k-means clustering algorithm and introduces a post-clustering enhancement based on geometric principles. This modification seeks to refine clustering results, especially in datasets with non-convex shapes where typical k-means methods often underperform. The authors propose post-processing the k-means results by assigning each cluster center a radius and merging clusters whose radii overlap.

K-Means Clustering Limitations

K-means clustering, despite its widespread use due to its speed and simplicity, suffers from inherent limitations such as the requirement to pre-specify the number of clusters and its inefficacy in dealing with clusters of non-linear separable, non-convex shapes. While alternatives like DBSCAN and hierarchical clustering provide solutions for arbitrary-shaped clusters, they often introduce greater computational complexities or demand tuning of nuanced parameters which can compromise interpretability.

Proposal of Radius-Based Post-Clustering

The paper introduces a radius-based post-processing technique which addresses the static nature of the k specification while accommodating complex cluster geometries. After executing standard k-means, the authors propose computing a radius for each cluster center defined by the distance to its farthest assigned data point. Clusters with overlapping radii are merged, effectively correcting over-segmentation and reconstructing non-convex cluster shapes. The applicability of this method is demonstrated on synthetic datasets and the FCPS benchmark suite, where clustering success rates of 98-100% were achieved in various scenarios, highlighting the robustness of the approach.

Scalability Through Recursive Partitioning

An added advantage of the radius-guided post-processing is its scalability. By partitioning the feature space into smaller tiled regions and executing clustering locally, the method allows a broader application in distributed systems or large datasets. The radius computation across tiles permits the recovery of cluster fragments split by artificial partitions. This enables the clustering process to scale linearly and efficiently, managing large volumes of data while preserving the integrity of the clustering structure across tile boundaries.

Experimental Evaluation

The radius-based merging method demonstrated superior performance in recovering ground-truth clusters across the FCPS datasets. Results indicated a highly robust and stable outcome even when initial k values were significantly overestimated. Notably, the method excels by retaining simple geometric intuition and preserving computational efficiency, thereby serving as a drop-in refinement to existing k-means implementations.

Implications and Future Directions

The paper's findings suggest practical implications in refining clustering methodologies particularly in big data scenarios, where scalability and shape recovery are pivotal. Future advancements could focus on optimizing radius definitions and developing adaptive heuristics for cluster merging. Addressing the method’s limitations in high-noise environments and its applicability to high-dimensional spaces could further enhance its utility. Moreover, extending these principles into more explicit noise-handling frameworks might yield significant improvements.

The radius-guided post-clustering method presents an elegant solution to pervasive issues in traditional k-means clustering, enhancing its adaptability to various cluster shapes and relaxing dependence on exact k specifications. Its empirical success in diverse datasets alongside its ease of integration positions it as a valuable tool for researchers in the field pursuing efficient and scalable clustering solutions.