Arabesque: A System for Distributed Graph Mining - Extended version (1510.04233v1)

Published 14 Oct 2015 in cs.DC

Abstract: Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social net- works, semantic web, and bioinformatics. In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Citations (228)

View on Semantic Scholar

Summary

The paper presents Arabesque, the first distributed platform for graph mining, using a novel filter-process model and efficient data structures.
Arabesque scales significantly, processing trillions of embeddings for frequent subgraph mining, motif counting, and clique finding tasks.
Arabesque is relevant for social networks and bioinformatics, enabling scalable analysis of complex graph patterns in large datasets.

Arabesque: A System for Distributed Graph Mining

The paper presents Arabesque, the first distributed data processing platform specifically designed to implement graph mining algorithms. This platform automates the exploration of an extensive number of subgraphs, catering to the demands of mining graph patterns in distributed environments. Unlike prior platforms such as MapReduce and Pregel, which do not efficiently support graph mining tasks due to the complexity and state explosion involved, Arabesque provides an improved mechanism with a focus on scalability and ease of use.

Overview

Arabesque employs a novel filter-process computational model, wherein subgraphs are explored and presented to user-defined algorithms for evaluation and extension. The platform deviates from the conventional "think like a vertex" paradigm used in graph computations, leveraging instead the "think like an embedding" (TLE) paradigm. In this model, graph mining applications define filtering and processing operations that determine the relevance of subgraphs and guide their exploration.

Arabesque's key innovations include a coordination-free exploration strategy using the concept of embedding canonicality, ensuring efficient deduplication of explored subgraphs. The platform also implements a compressed data structure known as the Overapproximating Directed Acyclic Graph (ODAG), which significantly reduces memory usage by compactly representing embeddings, a vital feature given the potentially exponential growth in subgraph enumeration.

Results and Contributions

The paper highlights three primary graph mining challenges addressed by Arabesque: frequent subgraph mining (FSM), motif counting, and clique finding. The implementation of these tasks using Arabesque's API demonstrates significant scalability, processing up to trillions of embeddings on large graphs across hundreds of cores. Importantly, some of these implementations represent the first distributed solutions within the literature, underscoring the simplicity and adaptability of the Arabesque API.

Evaluations reveal that Arabesque compares favorably even with optimized single-threaded implementations of centralized algorithms while delivering the scalability necessary for large-scale distributed environments. The system's efficiency in terms of memory and computational overhead, achieved through two-level pattern aggregation and ODAG compression, exemplifies its suitability for real-world graph mining tasks, such as analyzing large-scale social network datasets.

Implications and Future Work

Arabesque's advancements in scalable graph mining are particularly relevant for applications in social networks, bioinformatics, and semantic web analytics, where identifying complex graph patterns efficiently is of paramount importance. The implications of this work extend to both the theoretical understanding of distributed graph mining and practical deployments in data-intensive applications.

Looking forward, Arabesque sets the stage for future exploration into more complex graph mining problems and their solutions in a distributed context. Further refinement of its algorithms and data structures could lead to even greater efficiencies and the ability to handle increasingly massive datasets. Additionally, enhancements to the API to support newer graph mining paradigms could encourage broader adoption and application development across diverse domains.

In summary, Arabesque represents a significant step toward democratizing distributed graph mining, empowering non-expert users to implement scalable solutions with minimal effort while achieving state-of-the-art performance.

PDF Markdown