Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale (1605.05797v4)

Published 19 May 2016 in cs.SI and physics.soc-ph

Abstract: Notions of community quality underlie network clustering. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms -- Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on information recovery metrics. Our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Smart local moving is the best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it absolutely superior. Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters.

Citations (183)

View on Semantic Scholar

Summary

Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale

The paper "Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale" rigorously examines the effectiveness of several widely used network clustering algorithms alongside a range of cluster quality metrics. This exploration targets the intricate relationship between cluster quality metrics in isolation and metrics aimed at recovering embedded information, striving for clarity in large-scale network clustering scenarios.

The researchers have structured their investigation around four established algorithms: Louvain, Infomap, label propagation, and smart local moving (SLM). They assess these algorithms using both synthetic benchmark graphs generated following the LFR model and empirical datasets from social networking and academic co-authorship contexts. The synthetic graphs scale from 1,000 to 1,000,000 nodes, providing a robust test bed for measuring algorithm performance.

Among the stand-alone metrics scrutinized are modularity, conductance, and coverage, whereas adjusted Rand score, normalized mutual information, and Lancichinetti's normalized mutual information variant serve as the chosen information recovery metrics.

Significant observations from this paper highlight discrepancies among various metrics, thereby challenging the validity of using a single metric to assess algorithm performance effectively. A notable finding is the contradiction between modularity scores and information recovery results, suggesting modularity alone does not reflect true clustering capability. Instead, conductance emerges as the most reliable stand-alone metric, albeit imperfect.

The analysis reveals the smart local moving algorithm as the most effective overall, although its absolute superiority cannot be declared due to varying metric indications. Louvain's unexpected better performance over Infomap contradicts earlier findings, attributed to community size distributions and limits inherent in modularity and field-of-view constraints.

Label propagation demonstrates wide variability, influenced significantly by cluster definition clarity as indicated by the $\mu$ parameter in the LFR model. This adaptability emphasizes the dependency of optimal algorithm choice on the inherent structure within network data.

From a practical standpoint, this paper urges caution when selecting clustering algorithms for large-scale networks, advocating for domain-specific effectiveness testing due to unresolved discrepancies among metrics. The theoretical implications compel further research into a unified understanding of community clustering, aiming to reconcile intuitive notions with measurable algorithms.

Future advancements in network clustering could focus on redefining community metrics to encapsulate domain-specific nuances better and explore alternative models that may avoid synthetic assumptions present in benchmarks such as LFR. Enhanced clarity and predictability in clustering efficiency measurements could ultimately lead to improved applications in diverse fields reliant on network analysis.

Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale (1605.05797v4)

Summary

Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale

Related Papers