- The paper provides a comprehensive review of graph embedding by categorizing input variations, output representations, and key methodological techniques.
- It examines multiple methods—including matrix factorization, deep learning, and generative models—highlighting their trade-offs in capturing global and local graph structures.
- The analysis illustrates practical applications in node classification, link prediction, and whole-graph analysis while suggesting future research directions for scalability and dynamic graphs.
A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications
Graph embedding offers a potent solution to the computational challenges inherent in traditional graph analytics by converting graph data into a low-dimensional space. In this survey by Hongyun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang, the authors examine the field's current landscape, focusing on three primary facets: problem settings, techniques, and applications.
Problem Settings: Input and Output Diversity
Graph Embedding Input
The first categorization presented by the authors deals with the input variations in graph embedding. The input may include:
- Homogeneous Graphs: These consist of untyped nodes and edges, featuring either undirected or directed and weighted connections. The key challenge here is capturing diverse connectivity patterns within the graph's structure.
- Heterogeneous Graphs: These encompass multiple node and edge types, prevalent in community-based Question Answering (cQA) sites, multimedia networks, and knowledge graphs. Handling the global consistency among different entity types and addressing data imbalances are vital challenges in this context.
- Graphs with Auxiliary Information: These graphs include additional node or edge attributes (e.g., labels, features, or knowledge base content). Incorporating such rich and unstructured information into the embedding process, while maintaining structural integrity, defines the primary challenge.
- Constructed Graphs from Non-relational Data: Here, graphs are constructed using non-relational data inputs, where nodes and edges are generated based on various similarity measures or co-occurrence patterns. This involves ensuring that constructed relations preserve the underlying data's proximity measures effectively.
Graph Embedding Output
The output in graph embedding scenarios is another primary axis of categorization, including:
- Node Embedding: Typically, each node is represented as a vector, preserving the node's similarity to its neighbors.
- Edge Embedding: In contrast to node embedding, edge embedding focuses on capturing relationships between node pairs. Knowledge graph applications often utilize this type, with the embedding process ensuring the asymmetric properties of directed edges.
- Hybrid Embedding: This encompasses embedding substructures, such as node pairs or communities, and demands resolving the heterogeneity of embedded components.
- Whole-Graph Embedding: Representing an entire graph as a singular vector to facilitate similarity searches and classification tasks.
Embedding Techniques
The authors categorize the techniques into five principal methods:
- Matrix Factorization: This method involves factorizing a node proximity matrix, transformed through various means (e.g., Laplacian eigenmaps), to reduce dimensionality. While effective in capturing global proximities, it often suffers from high computational costs.
- Deep Learning: Various deep learning models, with or without random walks, are employed to capture graph characteristics. Despite their robust performance and ability to automate feature identification, deep learning approaches can become computationally intensive.
- Edge Reconstruction Based Optimization: This approach revolves around directly optimizing edge reconstruction probabilities or minimizing distance-based or margin-based ranking losses, offering efficiency advantages but often focusing primarily on local rather than global graph structures.
- Graph Kernel: A graph is decomposed into atomic substructures like graphlets, subtree patterns, or random walks, and is subsequently represented as a vector of substructure counts. While efficient, this method may suffer from sparsity due to rapid growth in dimensions.
- Generative Models: Embedding is viewed through the lens of semantic spaces, allowing the integration of node features through latent variable models. Challenges include choosing appropriate distributions and requiring substantial training data.
Applications
Graph embedding facilitates numerous applications:
- Node-Related: Embeddings are useful for node classification (e.g., using SVM or logistic regression), clustering, recommendation, retrieval, and ranking.
- Edge-Related: Common applications include link prediction and triplet classification within knowledge graphs.
- Whole-Graph: Tasks such as graph classification leverage entire graph embeddings for better performance and efficiency.
Future Directions
The survey suggests several future directions:
- Computation Efficiency: Addressing the computational inefficiency inherent to deep learning models, particularly for large-scale graphs.
- Dynamic Graph Embedding: Developing methods for graphs whose structure or informational content evolves over time.
- Techniques: Incorporating more comprehensive structural awareness and finding more scalable optimization methods.
- Expansion of Applications: Exploring a wider array of scenarios where graph embedding can provide novel insights and efficiency improvements.
Conclusion
This comprehensive survey synthesizes over a decade of research and development in graph embedding. By categorizing existing methods and identifying challenges and directions for future research, it provides a valuable reference point for academics and practitioners working within this domain.