A Hybrid Approach To Hierarchical Density-based Cluster Selection (1911.02282v4)

Published 6 Nov 2019 in cs.DB

Abstract: HDBSCAN is a density-based clustering algorithm that constructs a cluster hierarchy tree and then uses a specific stability measure to extract flat clusters from the tree. We show how the application of an additional threshold value can result in a combination of DBSCAN* and HDBSCAN clusters, and demonstrate potential benefits of this hybrid approach when clustering data of variable densities. In particular, our approach is useful in scenarios where we require a low minimum cluster size but want to avoid an abundance of micro-clusters in high-density regions. The method can directly be applied to HDBSCAN's tree of cluster candidates and does not require any modifications to the hierarchy itself. It can easily be integrated as an addition to existing HDBSCAN implementations.

Citations (4)

View on Semantic Scholar

Summary

The paper presents HDBSCAN(▯^) to mitigate excessive micro-cluster formation in dense regions while preserving cluster granularity.
It employs an adjustable distance threshold to balance cluster splitting and robustly merge small clusters within the HDBSCAN framework.
Experimental comparisons with OPTICS and standard HDBSCAN demonstrate its effectiveness in accurately clustering variable-density datasets.

Hybrid Approach to HDBSCAN for Clustering Data with Variable Densities

Introduction to the Hybrid Approach

The paper by Claudia Malzer and Marcus Baum introduces a novel method of cluster selection within the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) framework. This method, designated as HDBSCAN(▯^), innovatively combines elements of DBSCAN* and HDBSCAN to adeptly handle datasets characterized by variable densities. This hybrid approach primarily aims to optimize cluster selection in scenarios where a lower threshold for minimum cluster size is desirable, yet the consequent generation of numerous micro-clusters in dense data regions is problematic. Unlike traditional HDBSCAN implementations, this method does not necessitate modifications to the cluster hierarchy itself and can be seamlessly integrated into existing frameworks.

Background and Related Work

DBSCAN's Limitations: The paper outlines the limitations of DBSCAN, particularly its reliance on a global density threshold, which restricts its ability to identify clusters of variable densities.
Evolution of DBSCAN: Various modifications and extensions to DBSCAN, such as OPTICS and density-ratio based clustering, have been proposed to address these limitations.
HDBSCAN Overview: HDBSCAN extends DBSCAN by allowing the identification of clusters across different density levels, using a hierarchy of potential clusters and a stability-based metric for cluster selection.
FOSC Framework: HDBSCAN's optimization problem is framed within the "Framework for Optimal Selection of Clusters" (FOSC), which allows for the formalization and efficient resolution of cluster selection.

The HDBSCAN Algorithm and Extensions

Mutual Reachability Distance: The paper describes HDBSCAN's use of mutual reachability distance to construct a more robust clustering hierarchy, resisting noise and allowing for the exploration of cluster candidates across varying densities.
Condensed Cluster Hierarchy: HDBSCAN simplifies the clustering process by condensing the complex hierarchical tree, pruning it based on the core distance and minimum cluster size parameters.
Stability-Based Cluster Selection: The introduction of stability as a selection criterion facilitates the identification of significant clusters, optimizing for clusters that exhibit the highest stability across different density thresholds.

HDBSCAN(▯^): Addressing the Micro-Cluster Challenge

Motivation: The paper identifies a critical challenge in handling datasets with highly variable densities—particularly, the propensity of HDBSCAN to generate excessive micro-clusters in dense regions when a low minimum cluster size is selected.
Hybrid Approach Mechanism: The proposed HDBSCAN(▯⁾ method employs an additional distance threshold during cluster selection. This threshold acts as a limit below which clusters cannot split, effectively balancing the identification of small, dense clusters with the prevention of micro-cluster proliferation.
FOSC Compliance: By adhering to the FOSC framework, HDBSCAN(▯⁾ integrates seamlessly with existing HDBSCAN implementations, providing a method that is both local and additive in nature, enabling effective optimization.

Experimental Validation and Implications

Comparison of Clustering Algorithms: The paper presents a comprehensive comparison between HDBSCAN(eom), OPTICS, DBSCAN*, and the proposed HDBSCAN(▯⁾ across both synthetic and real datasets, demonstrating the effectiveness of the hybrid approach in reducing micro-clusters while preserving cluster granularity in variable-density data.
Adjustable Distance Threshold: The experiments highlight the flexibility afforded by HDBSCAN(▯^)'s distance threshold, which allows for the adjustment of cluster sensitivity according to specific data characteristics or research needs.
Practical Applications: The methodology shows promise in applications requiring detailed spatial analysis, such as GPS data clustering for ride-pooling systems or radar reflection analysis in autonomous driving datasets.

Concluding Remarks and Future Directions

The paper posits HDBSCAN(▯⁾ as a valuable extension to the HDBSCAN framework, particularly for applications involving spatial data with varying densities. Its compliance with FOSC and ease of integration into existing HDBSCAN implementations underscore its potential utility in enhancing cluster analysis without the need for hierarchical modifications. Future research directions could explore the integration of HDBSCAN(▯⁾ with semi-supervised learning models, further expanding its applicability and the depth of insights derived from complex datasets.

PDF Markdown