Hierarchical Clustering: Objective Functions and Algorithms (1704.02147v1)

Published 7 Apr 2017 in cs.DS and cs.LG

Abstract: Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties. We take an axiomatic approach to defininggood' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of "admissible" objective functions (that includes Dasgupta's one) that have the property that when the input admits a natural' hierarchical clustering, it has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better algorithms. For similarity-based hierarchical clustering, Dasgupta showed that the divisive sparsest-cut approach achieves an $O(\log^{3/2} n)$-approximation. We give a refined analysis of the algorithm and show that it in fact achieves an $O(\sqrt{\log n})$-approx. (Charikar and Chatziafratis independently proved that it is a $O(\sqrt{\log n})$-approx.). This improves upon the LP-based $O(\log n)$-approx. of Roy and Pokutta. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approx., and provide a simple and better algorithm that gives a factor 3/2 approx.. Finally, we considerbeyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function has desirable properties for these inputs and we provide a simple 1 + o(1)-approximation in this setting.

Citations (453)

View on Semantic Scholar

Summary

The paper formalizes objective functions using an axiomatic approach that aligns hierarchical clustering with combinatorial optimization principles.
It introduces the recursive φ-sparsest-cut algorithm, achieving a 27/4φ-approximation and outperforming traditional agglomerative methods on worst-case inputs.
The study validates its framework through hierarchical stochastic block models, ensuring algorithms perform near-ideally on structured, real-world data.

A Formal Evaluation of Hierarchical Clustering: Objectives and Algorithmic Insights

The paper explores hierarchical clustering, primarily focusing on objective functions and algorithmic approaches to this nuanced data clustering method. Historically, hierarchical clustering focused more on the procedural aspect, neglecting well-formulated objective functions. This paper revisits this oversight and establishes a formal framework, originating from Dasgupta's 2016 work, aligning hierarchical clustering with a combinatorial optimization perspective.

Objective Function Characterization

The authors provide an axiomatic approach to defining objective functions for hierarchical clustering. They introduce the notion of admissible functions that ensure hierarchical trees, generated for similarity and dissimilarity graphs (ground-truth inputs), retain an optimal cost. This is established under the condition of symmetry in tree node order, monotonic increase in child cluster cardinalities, and equality of costs across all binary trees for clique-like graphs. These conditions create a robust framework for evaluating hierarchical clusters, aligning machine-generated trees with potentially innate 'ground truth' structures.

Algorithmic Insights and Performance

The authors delve into the algorithmic implications of hierarchical clustering, examining existing heuristic methods and introducing new algorithms. The recursive φ-sparsest-cut algorithm stands out, offering improved performance bounds on optimal hierarchical clustering over previous linear programming approaches.

For similarity graphs, the authors establish the φ-sparsest-cut algorithm as providing a 27/4φ-approximation, emphasizing its reliability even in worst-case scenarios. Conversely, many agglomerative methods, like average-linkage and single-linkage, traditionally employed in practice, demonstrate significant limitations when handling worst-case inputs. Despite these limitations, the extensions provided form underpinnings for improving approximation approaches beyond worst-case analyses.

Handling Random and Structured Inputs

The exploration into beyond-worst-case scenarios emphasizes the Hierarchical Stochastic Block Model (HSBM) to test the algorithms' robustness on data approximating ground truth. By relying on such stochastic models, the authors bridge real-world applicability with theoretical constructs, confirming that algorithms can achieve near-ideal performance on soundly structured graphs adhering to certain probabilistic distributions.

Implications and Future Directions

The formalization of objective functions in hierarchical clustering addresses an enduring oversight in the clustering domain. It sets a precedent for aligning clustering algorithms with these formalized objectives, ensuring hierarchical trees reflect genuine underlying data structures. The introduction of scalable algorithms also bolsters the processing of large datasets common in contemporary machine learning contexts. Future work might focus on expanding these approaches to accommodate even more complex real-world scenarios and refining algorithmic efficiency further.

In conclusion, the paper's framework for examining hierarchical clustering through objective functions with rigorous algorithmic scrutiny not only solidifies understanding in the domain but also opens avenues for future explorations and refinement in AI-driven data analysis. The traditional constraints are revisited, providing a robust axiomatic basis for clustering evaluations that promise to impact real-world applications across various fields such as big-data analytics and bioinformatics.

PDF Markdown