Crawling Facebook for Social Network Analysis Purposes (1105.6307v1)

Published 31 May 2011 in cs.SI, cs.CY, and physics.soc-ph

Abstract: We describe our work in the collection and analysis of massive data describing the connections between participants to online social networks. Alternative approaches to social network data collection are defined and evaluated in practice, against the popular Facebook Web site. Thanks to our ad-hoc, privacy-compliant crawlers, two large samples, comprising millions of connections, have been collected; the data is anonymous and organized as an undirected graph. We describe a set of tools that we developed to analyze specific properties of such social-network graphs, i.e., among others, degree distribution, centrality measures, scaling laws and distribution of friendship.

Citations (229)

View on Semantic Scholar

Summary

The paper investigates privacy-compliant Facebook crawling to measure metrics like degree distribution and centrality.
It compares BFS and Uniform sampling, showing BFS's tendency to over-represent high-degree nodes versus the unbiased Uniform approach.
Empirical results reveal key network properties, including power-law distributions, small-world phenomena, and variations in connected components.

The paper in focus investigates methodologies for the collection and analysis of social network data from Facebook by leveraging privacy-compliant crawlers. The researchers have meticulously collected large datasets to analyze the intricate properties of online social networks (OSNs), focusing on metrics such as degree distribution, centrality measures, and scaling laws. These metrics hold immense potential for understanding both the structure and dynamics of online communities and their correlation with real-life social structures.

Adopting two distinct approaches—Breadth-First Search (BFS) sampling and Uniform sampling—the paper presents a comparative paper on the efficacy and accuracy of data collection from OSNs. BFS, a graph traversal algorithm, was utilized extensively across OSN crawling tasks due to its systematic approach to visiting nodes. However, it is noted that BFS tends to over-represent high-degree nodes, suggesting a bias that Uniform sampling aims to correct. The Uniform sampling approach, based on rejection sampling, ensures a more robust and unbiased representation of the graph, focusing on the stochastic selection of user IDs.

The paper's datasets, chosen for their scalability and representability, demonstrated significant findings. In the BFS sample, 12.58 million unique edges connected 8.21 million nodes, with an average degree of 396.8. The structural insights provided by effective diameter measurements and clustering coefficients in both samples underscore the small-world phenomenon, albeit the paper notes potential algorithmic biases in BFS samples. Additionally, privacy settings on Facebook presented a noteworthy obstacle, inherently limiting data accessibility and influencing sampling success.

Another critical aspect explored is the representation of connected components within the network, with BFS capturing nearly complete graph connectivity versus a somewhat fragmented component in Uniform sampling. This differentiation in component analysis reflects varying network interpretations based on sampling methodology.

Experimental results, meticulously derived using the SNAP library, reveal insightful conclusions. Degree distribution analyses affirmed the power-law hypothesis common in social networks' topological characterizations, while hop plots, diameter metrics, and clustering coefficients contributed to a nuanced understanding of Facebook's social structure.

Looking forward, the implications of this paper could extend beyond mere structural comprehension, serving as a template for the design of efficient data mining frameworks in OSNs. With its eye on future scalability and methodological enhancements, the paper suggests parallelizing data collection techniques—a prospect that promises accelerated data acquisition without compromising integrity.

In conclusion, the paper represents a significant step in social network analysis by demonstrating nuanced data collection approaches and offering insights into Facebook's dynamic user interactions. As SNAs continue to evolve, these findings could fuel advancements in computational social science, allowing researchers to more accurately decode the complex web of relationships that characterize modern digital platforms. Future work in parallel processing and real-time adaptation to network changes holds potential for even deeper insights into OSN dynamics, contributing substantially to both theoretical knowledge and practical applications in network science and data analytics.