- The paper presents a comprehensive review of techniques that map high-dimensional, sparse graphs into dense, low-dimensional vector spaces while preserving structural properties.
- It details methodologies such as random walk-based, Gaussian distribution-based, and dynamic embedding methods, underscoring scalability and uncertainty quantification.
- The paper demonstrates practical applications across social networks, citation graphs, brain imaging, and genomic data, highlighting effectiveness in modern network analysis.
This paper provides a comprehensive review of graph embedding methods, focusing on their mathematical formulation, core techniques, and practical applications. It highlights the necessity of graph embedding to overcome the computational and memory challenges posed by traditional graph analytics on large, high-dimensional networks. The core idea of graph embedding is to map nodes from a high-dimensional, sparse graph into a low-dimensional, dense vector space while preserving structural properties, allowing for easier quantification of node similarity using standard metrics.
Key Concepts and Formulation:
- Graph Basics: Defines graphs G(V,E), nodes V, edges E, and different graph types (directed/undirected, homogeneous/heterogeneous, weighted/unweighted) and representations (adjacency matrix, adjacency list, incidence matrix).
- Problem Formulation: Mathematically defines graph embedding as learning a mapping ϕ from nodes vi to either low-dimensional vectors zi∈RL or probability distributions Pi∼N(μi,Σi), where L≪∣V∣. The goal is to ensure that similarity in the embedded space (e.g., zjTzi or μjTμi) approximates similarity Sim(vi,vj) in the original graph.
- Structure Preservation: Discusses preserving graph structure through proximity measures:
- First-order proximity: Similarity between directly connected nodes.
- Second-order proximity: Similarity between nodes' neighborhood structures.
- High-order proximity: More complex relationships captured by metrics like Rooted PageRank.
- Node Similarity Measures: Reviews methods to define node similarity in the original graph:
- Multi-hop neighborhood: Nodes reachable within k hops.
- Random walk-based: Probability of reaching one node from another via a random walk.
- Adjacency matrix-based: Based on direct connections in the matrix.
- Embedded Space Similarity: Lists metrics for measuring similarity between embeddings:
- Vector embeddings: Dot product, cosine similarity, Euclidean distance.
- Gaussian embeddings: Expected Likelihood (EL), KL-divergence (DKL), Wasserstein distance (W2).
Graph Embedding Methods:
- Vector Point-based Methods: Aim to embed nodes as single points in a latent space.
- Random Walk-based (DeepWalk, LINE, node2vec): These methods are highlighted for scalability. The pipeline involves:
- Generating node context (sequences) using random walks (e.g., DeepWalk's truncated walks, node2vec's biased walks controlled by parameters p and q to balance BFS/DFS exploration).
- Learning embeddings using an encoder (typically a SkipGram model) by optimizing an objective function (e.g., maximizing likelihood of context nodes given a source node) using techniques like hierarchical softmax or negative sampling with SGD.
- Using the learned vectors (Φ∈R∣V∣×L) for downstream tasks.
- Limitations include the inability to capture network uncertainty.
- Gaussian Distribution-based Methods (KG2E, Graph2Gauss/G2G, DVNE): Embed nodes as multivariate Gaussian distributions N(μi,Σi), capturing uncertainty via the covariance Σi.
- Advantages: Model uncertainty, incorporate node attributes, represent nodes as "soft regions".
- Implementation (e.g., G2G): Involves generating node triplets (i,jk,jl) where node i is closer to jk than jl (based on k-hop neighborhoods), using a deep encoder to map node attributes X to μ and Σ, defining an energy function (e.g., KL-divergence) to measure distance between distributions, and minimizing an energy-based ranking loss (e.g., square-exponential loss) to enforce proximity constraints (E(Pi,Pjk)<E(Pi,Pjl)).
- Benefits: Quantifies neighborhood diversity, helps discover intrinsic graph dimensionality.
- Dynamic Graph Embedding Methods: Address graphs that evolve over time.
- Representations: Discrete snapshots (G1,...,GT) or continuous-time graphs (temporal edges, link streams).
- Challenges: Ensuring embedding stability across time steps, efficiently updating embeddings, scaling to large dynamic networks.
- Approaches: Include methods based on matrix factorization, SkipGram, autoencoders, GCNs, and GANs.
Applications:
The paper showcases graph embedding applications across diverse domains:
- Social Networks: Community detection in Zachary's karate club using DeepWalk; analysis of large-scale attributed networks (Facebook, Twitter).
- Citation Networks: Node classification and analysis of uncertainty in CORA/CORA-ML datasets using Graph2Gauss, demonstrating how variance relates to neighborhood diversity and intrinsic dimensionality.
- Brain Networks: Using node2vec for predicting structure-function links; using multi-graph Gaussian embedding (MG2G) on fMRI data to quantify cognitive training effects in aMCI patients, leveraging uncertainty for patient-specific analysis and identifying affected regions/systems; predicting Alzheimer's progression using MEG data.
- Genomic Networks: Identifying chromatin sub-compartments from Hi-C data using LINE embedding and k-means; analyzing gene sets using Gaussian embedding (set2Gauss) to model functional diversity and uncertainty within sets based on protein-protein interaction networks.
Implementation and Summary:
- The paper emphasizes practical implementation, providing details in the appendix for node2vec on the Karate club dataset and Graph2Gauss on the CORA-ML dataset, mentioning libraries like NetworkX, Gensim, scikit-learn, and TensorFlow.
- It concludes by positioning graph embedding as a powerful, scalable approach for modern network analysis, superior to traditional methods and older dimensionality reduction techniques like Isomap due to lower computational complexity (often linear) and added capabilities like uncertainty quantification. The review focuses on scalable random walk and neural network-based methods for both static and dynamic graphs.