Defining and Evaluating Network Communities based on Ground-truth (1205.6233v3)

Published 28 May 2012 in cs.SI and physics.soc-ph

Abstract: Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task mainly due to a plethora of definitions of a community, intractability of algorithms, issues with evaluation and the lack of a reliable gold-standard ground-truth. In this paper we study a set of 230 large real-world social, collaboration and information networks where nodes explicitly state their group memberships. For example, in social networks nodes explicitly join various interest based social groups. We use such groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology which allows us to compare and quantitatively evaluate how different structural definitions of network communities correspond to ground-truth communities. We choose 13 commonly used structural definitions of network communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad-participation-ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than hundred million nodes. The proposed method achieves 30% relative improvement over current local clustering methods.

Authors (2)

Jaewon Yang (16 papers)
Jure Leskovec (233 papers)

Citations (2,115)

View on Semantic Scholar

Summary

The paper demonstrates that conductance and TPR are the most effective metrics, achieving a 30% improvement in F1-score via a seed-based community detection method.
It rigorously evaluates 13 structural definitions across 230 datasets, employing four perturbation strategies to assess robustness and sensitivity.
The findings underscore the importance of using ground-truth communities to develop scalable, accurate community detection algorithms for real-world networks.

Defining and Evaluating Network Communities Based on Ground-Truth

In their paper, "Defining and Evaluating Network Communities Based on Ground-Truth," Jaewon Yang and Jure Leskovec address the challenge of identifying and evaluating network communities. The authors focus on leveraging ground-truth communities in 230 large-scale networks to assess various structural definitions and community detection methods quantitatively.

Conceptual Framework

The paper examines real-world networks where nodes explicitly state their group memberships, such as social networks where users join interest groups. These explicit memberships form the basis of ground-truth communities—essential benchmarks for evaluating structural definitions of network communities. The authors select 13 commonly used structural definitions and compare them across various robustness, sensitivity, and performance metrics.

Methodological Approach

Ground-Truth Community Definition

Yang and Leskovec compile 230 datasets from different domains, including social, collaboration, and information networks. A few examples include:

Online social networks like LiveJournal, Orkut, and Friendster, where users explicitly join interest-based groups.
Amazon co-purchasing network, where products are grouped based on hierarchically nested categories.
DBLP collaboration network using publication venues as proxies for research communities.

Structural Definitions and Community Scoring Functions

The authors evaluate 13 different scoring functions, which they group into four classes based on their relationship:

Internal Connectivity - e.g., internal density, triangle participation ratio (TPR).
External Connectivity - e.g., expansion, cut ratio.
Combined Internal and External Connectivity - e.g., conductance, normalized cut.
Network Modularity - e.g., modularity score.

Evaluation Metrics

The proposed evaluation framework includes several community goodness metrics:

Separability: Ratio of internal to external edges.
Density: Fraction of possible internal edges that actually appear.
Cohesiveness: A measure of internal conductance.
Clustering Coefficient: Fraction of a node's neighbors that are interconnected.

Experimental Findings

Correlations Among Scoring Functions

The analysis reveals that the 13 scoring functions cluster into four natural groups, suggesting some structural definitions are highly correlated. Notably, modularity stands out due to its negligible correlation with other scoring functions.

Performance on Ground-Truth Communities

The results indicate that conductance and TPR provide the highest fidelity in identifying ground-truth communities. Conductance excels in capturing well-separated communities, while TPR is more effective for detecting dense and cohesive structures.

Robustness and Sensitivity

Using four perturbation strategies (NodeSwap, Random, Expand, Shrink), the authors assess the scores' robustness. They find conductance and TPR to be the most robust and sensitive, as these scoring functions maintain low Z-scores under slight perturbation but exhibit significant changes when the perturbation increases.

Community Detection from a Seed Node

Extending the local spectral clustering algorithm, Yang and Leskovec introduce a parameter-free community detection method that achieves significant improvements over existing approaches. Notably, it achieves a 30\% relative improvement in F1-score over conventional methods when detecting communities from seed nodes.

Implications and Future Directions

The paper's findings emphasize the importance of the definition and evaluation of community detection methods using ground-truth communities. By providing a robust and scalable evaluation framework, the research can pave the way for better community detection algorithms. Future research could explore new structural definitions tailored to specific network types or improve methods for community detection in overlapping and multilayer networks.

Conclusion

"Defining and Evaluating Network Communities Based on Ground-Truth" contributes significantly to the field of network science by systematically evaluating various structural definitions and community detection methods. Yang and Leskovec’s work underscores the necessity of rigorous, scalable, and data-driven evaluation methodologies to advance community detection techniques.

PDF Markdown