GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation (2001.08184v2)

Published 22 Jan 2020 in cs.LG and stat.ML

Abstract: Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at https://github.com/idea-iitd/graphgen.

PDF Abstract

Summary of "GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation"

The paper entitled "GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation" addresses key challenges in graph generative modeling, with a focus on domain-agnostic, scalable techniques for labeled graphs. The authors propose a novel approach using minimum depth-first search (DFS) codes that serve as canonical labels for graphs, effectively converting graphs into sequences, which can then be modeled using recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) architectures. This formulation allows them to capture complex joint distributions of graph structures and semantic labels without domain-specific tweaks or assumptions.

Key Contributions and Methodology

Canonization via DFS Codes: The paper introduces the concept of using minimum DFS codes as canonical labels for graphs to enable unique sequence representation. These DFS codes overcome issues related to the many-to-one mappings in traditional adjacency matrix or random sequence representations, leading to efficient training through reduced redundancy.
Novel Use of LSTM: The authors utilize an LSTM-based architecture tailored for the DFS code sequences, positing that these canonical sequences are more amenable to deep sequence modeling. The neural architecture consists of a state transition function, an embedding function, and multiple output functions specifically designed to generate each component of a graph’s edge tuple independent of each other.
Scalability and Robustness: The algorithm demonstrates significant improvements over existing methods, such as GraphRNN and DeepGMG, in terms of both efficiency and quality across multiple datasets spanning chemical compounds, citation networks, and protein structures. GraphGen shows faster training times and scales better with large datasets and graph sizes.

Numerical Results and Evaluation

The experiments conducted on real datasets underscore the efficacy of GraphGen in generating realistic graphs. On average, GraphGen performs significantly better than the existing techniques across multiple quality metrics, including node degree distribution, clustering coefficient distribution, orbit count, and NSPDK kernel distance, which measures holistic similarity between generated and test graphs. Furthermore, GraphGen maintains high diversity in generated graphs as indicated by its novelty and uniqueness metrics.

Theoretical and Practical Implications

The approach of using DFS codes introduces a compelling alternative to graph modeling by leveraging the graph's inherent label structure for efficient and precise sequence representation. It can potentially transform practices in fields relying on graph analysis, such as chemistry (molecular generation), bioinformatics (protein interaction networks), and social sciences (opinions or influence modeling in networks).

On the theoretical side, the method circumvents the need for domain-specific assumptions, thus making it broadly applicable. The paper sets the stage for future investigations into even more ambitious graph modeling tasks including those involving features rather than discrete labels.

Future Directions

As suggested by the results and discussions in the paper, this research could progress towards handling unlabeled graphs or those with continuous features rather than categorical labels. Moreover, the scalability for incredibly large graphs remains an open arena for exploration — particularly those comprising millions of nodes, which would require addressing computational challenges not covered within the scope of current state-of-the-art techniques.

In essence, the work on GraphGen exemplifies an innovative stride in domain-agnostic, scalable labeled graph generation, paving the way for more universal application of graph generative models across diverse data types and domains.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Nikhil Goyal (1 paper)
Harsh Vardhan Jain (2 papers)
Sayan Ranu (41 papers)

Citations (86)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - idea-iitd/graphgen: GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation (49 stars)