Efficient Subgraph Matching on Billion Node Graphs (1205.6691v1)

Published 30 May 2012 in cs.DB

Abstract: The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.

Citations (396)

View on Semantic Scholar

Summary

The paper introduces a novel algorithm that performs subgraph matching efficiently on billion-node graphs by avoiding heavy indexing.
The methodology leverages query decomposition into STwigs and distributed parallel processing to address scalability challenges.
Experimental results show linear scalability and faster response times, making it ideal for real-time graph analytics across various domains.

Efficient Subgraph Matching on Billion Node Graphs

The paper "Efficient Subgraph Matching on Billion Node Graphs" addresses a fundamental challenge in graph data processing: performing subgraph matching on extremely large-scale graphs, specifically those with billions of nodes. The authors propose a novel algorithm designed to overcome the limitations of existing approaches that often rely on super-linear space or time complexities, which become infeasible at such a large scale.

Subgraph Matching Challenge

Subgraph matching is a critical operation in many domains, requiring the identification of all subgraphs within a data graph G that are isomorphic to a query graph Q. This task becomes particularly challenging on billion-node graphs due to issues such as the vast size of the data, the need for effective query processing, and the lack of locality in graph data access.

Proposed Solution

The authors introduce a scalable algorithm leveraging distributed memory stores and parallel computing, avoiding the reliance on traditional index-based methods that incur prohibitive costs in space and construction time. Instead, their approach utilizes efficient graph exploration techniques facilitated by a memory cloud infrastructure called Trinity.

Key Features of the Approach

Graph Exploration without Heavy Indexing: The method eschews traditional indices, using a lightweight string index mapping labels to node IDs. Graph exploration relies on the memory cloud's efficient data access capabilities to process queries by directly navigating through graph nodes and edges.
Decomposition into STwigs: The algorithm breaks down the query into smaller units called STwigs, which are two-level tree structures. Each STwig is processed independently, and results are joined in an optimized manner to form complete solutions to the subgraph matching problem.
Distributed and Parallel Processing: The algorithm's design supports execution across multiple machines in a cluster. By distributing data and workload, it significantly improves processing time and scales with the addition of more computational resources.
Query Optimization Techniques: The work provides strategies for query decomposition, STwig ordering, and systematizes the selection of head STwigs and load sets to minimize communication overhead and effectively utilize computational resources.
Pipeline Join Strategy: To manage the size of intermediary results and maintain efficient memory usage, a pipelined strategy for join operations is implemented, allowing for partial result processing and reducing overall system latency.

Experimental Evaluation and Performance

The authors validate their approach through extensive experimentation on both real and synthetic datasets, achieving efficient response times even on graphs containing billions of nodes. Their system significantly outperforms traditional methods that utilize heavy indexing schemes, demonstrating linear scalability concerning graph size and graph density, while also showcasing effective parallel speedup.

Implications and Future Directions

Practically, this research opens avenues for scalable graph data processing in applications requiring real-time or near-real-time query response times, in fields such as web data analytics, bioinformatics, and network analysis. Theoretically, it contributes to the understanding of large graph processing, emphasizing the importance of lightweight indexing and distributed computing frameworks.

Future developments may focus on refining query optimization techniques, improving load balancing across clusters, and exploring integration with emerging hardware technologies to further enhance performance. This work lays a foundation for tackling even more complex graph operations at unprecedented scales, highlighting a shift towards memory-centric and distributed computation models in graph analytics.

PDF Markdown