Densest Subgraph in Streaming and MapReduce (1201.6567v1)

Published 31 Jan 2012 in cs.DB

Abstract: The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any epsilon>0, our algorithms make O((log n)/log (1+epsilon)) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1+epsilon) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive real-world graphs showing the performance and scalability of our algorithms in practice.

Authors (3)

Bahman Bahmani (4 papers)
Ravi Kumar (146 papers)
Sergei Vassilvitskii (44 papers)

Citations (234)

View on Semantic Scholar

Summary

The paper proposes efficient algorithms for finding densest subgraphs in large-scale graphs suitable for streaming and MapReduce frameworks.
These algorithms achieve a constant-factor approximation (within 2(1+ε) of optimum) and scale well, requiring only O(log^(1+ε) n) passes over the data.
The practical applications include detecting communities in social networks, identifying gene interactions in biology, and improving link spam detection.

An Analysis of "Densest Subgraph in Streaming and MapReduce"

The paper "Densest Subgraph in Streaming and MapReduce" by Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii addresses the notable problem of identifying densest subgraphs within large-scale graph datasets. This problem is pivotal in data analysis with applications spanning community detection, spam detection, and biological network analysis. The core contribution of the paper lies in developing efficient algorithms that can operate in both streaming and distributed computing frameworks, such as MapReduce, to process graphs with billions of edges.

Key Contributions

The authors propose algorithms that approximate the solution to the densest subgraph problem within a density factor of 2(1+ε) of the optimum. Importantly, these algorithms are designed to perform efficiently under the constraints of streaming and MapReduce models, addressing the challenge of limited memory and the need for parallel computation. The algorithm makes O(log^1+ε n) passes over the input data, ensuring scalability despite substantial graph sizes.

Theoretical Implications

The densest subgraph problem, both for undirected and directed graphs, has been previously addressed using methods involving parametric flow and linear programming relaxation, which provide exact solutions but are impractical for very large graphs due to their computational intensity. The paper extends on Charikar's combinatorial approximation approach, offering significant theoretical interest: achieving a constant-factor approximation within a sublinear memory footprint in the streaming model is a non-trivial accomplishment. The established space lower bounds also highlight the efficiency of the proposed algorithms.

Practical Applications

From a practical standpoint, the newly formulated algorithms are applicable to varied computational contexts. In community mining, for example, the effective identification of dense subgraphs enhances the understanding of community structures within social networks. In computational biology, the algorithms can identify dense gene interactions, offering insights into biological processes. Link spam detection also benefits, with dense subgraph detection serving as a feature to improve web search algorithms by recognizing link spam structures.

Experimental Evaluation

The authors conduct experimental evaluations on massive datasets, including social network graphs such as Flickr and Twitter. The results demonstrate that the algorithms not only scale well but also often achieve near-optimal results. This empirical evidence substantiates the theoretical approximation guarantees, reinforcing the algorithms' applicability to real-world graph datasets. Notably, the paper's findings suggest that parameters governing the trade-off between passes and approximation accuracy can be tuned to achieve desired computational efficiencies without significantly compromising on the quality of the solution.

Future Directions

Looking forward, the paper hints at several promising research paths. The adaptation of these algorithms could be expanded to more computing frameworks, verifying their robustness and scalability across diverse environments beyond the datasets evaluated. There's also potential in exploring further optimizations or heuristics to minimize computational overhead further in practical applications. As graph datasets continue to grow in complexity and size, enhancements in these directions could yield even more efficient methodologies.

In summary, the paper contributes significant advancements to the algorithms used in dense subgraph detection suitable for modern computational frameworks. By bridging theoretical insights with empirical validation, it provides a robust toolset for researchers and practitioners managing and analyzing complex graph datasets at scale.

PDF Markdown