Understanding graph embedding methods and their applications (2012.08019v1)

Published 15 Dec 2020 in cs.LG, cs.IT, cs.SI, and math.IT

Abstract: Graph analytics can lead to better quantitative understanding and control of complex networks, but traditional methods suffer from high computational cost and excessive memory requirements associated with the high-dimensionality and heterogeneous characteristics of industrial size networks. Graph embedding techniques can be effective in converting high-dimensional sparse graphs into low-dimensional, dense and continuous vector spaces, preserving maximally the graph structure properties. Another type of emerging graph embedding employs Gaussian distribution-based graph embedding with important uncertainty estimation. The main goal of graph embedding methods is to pack every node's properties into a vector with a smaller dimension, hence, node similarity in the original complex irregular spaces can be easily quantified in the embedded vector spaces using standard metrics. The generated nonlinear and highly informative graph embeddings in the latent space can be conveniently used to address different downstream graph analytics tasks (e.g., node classification, link prediction, community detection, visualization, etc.). In this Review, we present some fundamental concepts in graph analytics and graph embedding methods, focusing in particular on random walk-based and neural network-based methods. We also discuss the emerging deep learning-based dynamic graph embedding methods. We highlight the distinct advantages of graph embedding methods in four diverse applications, and present implementation details and references to open-source software as well as available databases in the Appendix for the interested readers to start their exploration into graph analytics.

Authors (1)

Mengjia Xu (13 papers)

Citations (117)

View on Semantic Scholar

Summary

The paper presents a comprehensive review of techniques that map high-dimensional, sparse graphs into dense, low-dimensional vector spaces while preserving structural properties.
It details methodologies such as random walk-based, Gaussian distribution-based, and dynamic embedding methods, underscoring scalability and uncertainty quantification.
The paper demonstrates practical applications across social networks, citation graphs, brain imaging, and genomic data, highlighting effectiveness in modern network analysis.

This paper provides a comprehensive review of graph embedding methods, focusing on their mathematical formulation, core techniques, and practical applications. It highlights the necessity of graph embedding to overcome the computational and memory challenges posed by traditional graph analytics on large, high-dimensional networks. The core idea of graph embedding is to map nodes from a high-dimensional, sparse graph into a low-dimensional, dense vector space while preserving structural properties, allowing for easier quantification of node similarity using standard metrics.

Key Concepts and Formulation:

Graph Basics: Defines graphs $G(V,E)$ , nodes $V$ , edges $E$ , and different graph types (directed/undirected, homogeneous/heterogeneous, weighted/unweighted) and representations (adjacency matrix, adjacency list, incidence matrix).
Problem Formulation: Mathematically defines graph embedding as learning a mapping $\phi$ from nodes $v_i$ to either low-dimensional vectors $z_i \in \mathbb{R}^L$ or probability distributions $P_i \sim \mathcal{N}(\mu_i, \Sigma_i)$ , where $L \ll |V|$ . The goal is to ensure that similarity in the embedded space (e.g., $z_j^T z_i$ or $\mu_j^T \mu_i$ ) approximates similarity $Sim(v_i, v_j)$ in the original graph.
Structure Preservation: Discusses preserving graph structure through proximity measures:
- First-order proximity: Similarity between directly connected nodes.
- Second-order proximity: Similarity between nodes' neighborhood structures.
- High-order proximity: More complex relationships captured by metrics like Rooted PageRank.
Node Similarity Measures: Reviews methods to define node similarity in the original graph:
- Multi-hop neighborhood: Nodes reachable within k hops.
- Random walk-based: Probability of reaching one node from another via a random walk.
- Adjacency matrix-based: Based on direct connections in the matrix.
Embedded Space Similarity: Lists metrics for measuring similarity between embeddings:
- Vector embeddings: Dot product, cosine similarity, Euclidean distance.
- Gaussian embeddings: Expected Likelihood (EL), KL-divergence ( $D_{KL}$ ), Wasserstein distance ( $W_2$ ).

Graph Embedding Methods:

Vector Point-based Methods: Aim to embed nodes as single points in a latent space.
- Random Walk-based (DeepWalk, LINE, node2vec): These methods are highlighted for scalability. The pipeline involves:
  - Generating node context (sequences) using random walks (e.g., DeepWalk's truncated walks, node2vec's biased walks controlled by parameters $p$ and $q$ to balance BFS/DFS exploration).
  - Learning embeddings using an encoder (typically a SkipGram model) by optimizing an objective function (e.g., maximizing likelihood of context nodes given a source node) using techniques like hierarchical softmax or negative sampling with SGD.
  - Using the learned vectors ( $\Phi \in \mathbb{R}^{|V|\times L}$ ) for downstream tasks.
- Limitations include the inability to capture network uncertainty.
Gaussian Distribution-based Methods (KG2E, Graph2Gauss/G2G, DVNE): Embed nodes as multivariate Gaussian distributions $\mathcal{N}(\mu_i, \Sigma_i)$ $N (μ_{i}, Σ_{i})$ , capturing uncertainty via the covariance $\Sigma_i$ $Σ_{i}$ .
- Advantages: Model uncertainty, incorporate node attributes, represent nodes as "soft regions".
- Implementation (e.g., G2G): Involves generating node triplets $(i, j_k, j_l)$ where node $i$ is closer to $j_k$ than $j_l$ (based on k-hop neighborhoods), using a deep encoder to map node attributes $X$ to $\mu$ and $\Sigma$ , defining an energy function (e.g., KL-divergence) to measure distance between distributions, and minimizing an energy-based ranking loss (e.g., square-exponential loss) to enforce proximity constraints ( $E(P_i, P_{j_k}) < E(P_i, P_{j_l})$ ).
- Benefits: Quantifies neighborhood diversity, helps discover intrinsic graph dimensionality.
Dynamic Graph Embedding Methods: Address graphs that evolve over time.
- Representations: Discrete snapshots ( $G_1, ..., G_T$ ) or continuous-time graphs (temporal edges, link streams).
- Challenges: Ensuring embedding stability across time steps, efficiently updating embeddings, scaling to large dynamic networks.
- Approaches: Include methods based on matrix factorization, SkipGram, autoencoders, GCNs, and GANs.

Applications:

The paper showcases graph embedding applications across diverse domains:

Social Networks: Community detection in Zachary's karate club using DeepWalk; analysis of large-scale attributed networks (Facebook, Twitter).
Citation Networks: Node classification and analysis of uncertainty in CORA/CORA-ML datasets using Graph2Gauss, demonstrating how variance relates to neighborhood diversity and intrinsic dimensionality.
Brain Networks: Using node2vec for predicting structure-function links; using multi-graph Gaussian embedding (MG2G) on fMRI data to quantify cognitive training effects in aMCI patients, leveraging uncertainty for patient-specific analysis and identifying affected regions/systems; predicting Alzheimer's progression using MEG data.
Genomic Networks: Identifying chromatin sub-compartments from Hi-C data using LINE embedding and k-means; analyzing gene sets using Gaussian embedding (set2Gauss) to model functional diversity and uncertainty within sets based on protein-protein interaction networks.

Implementation and Summary:

The paper emphasizes practical implementation, providing details in the appendix for node2vec on the Karate club dataset and Graph2Gauss on the CORA-ML dataset, mentioning libraries like NetworkX, Gensim, scikit-learn, and TensorFlow.
It concludes by positioning graph embedding as a powerful, scalable approach for modern network analysis, superior to traditional methods and older dimensionality reduction techniques like Isomap due to lower computational complexity (often linear) and added capabilities like uncertainty quantification. The review focuses on scalable random walk and neural network-based methods for both static and dynamic graphs.

PDF Markdown

Understanding graph embedding methods and their applications (2012.08019v1)

Summary

Related Papers