Distributed Graph Embedding with Information-Oriented Random Walks (2303.15702v2)
Abstract: Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.
- Streaming Graph Partitioning: An Experimental Study. Proc. VLDB Endow. 11, 11 (2018), 1590–1603.
- A Social Network Caught in the Web. First Monday 8, 6 (2003).
- A.-L. Barabasi and R. Albert. 1999. Emergence of Scaling in Random Networks. Science 286, 5439 (1999), 509–512.
- M. Belkin and P. Niyogi. 2001. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In NeurIPS.
- Node Classification in Social Networks. In Social Network Data Analytics, Charu C. Aggarwal (Ed.). Springer, 115–148.
- An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (2002), 135–151.
- A Large Time-Aware Graph. SIGIR Forum 42, 2 (2008), 33–38.
- Recent Advances in Graph Partitioning. In Algorithm Engineering - Selected Results and Surveys. Lecture Notes in Computer Science, Vol. 9220. 117–158.
- A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1616–1637.
- Machine Learning at the Limit. In IEEE International Conference on Big Data.
- R-MAT: A Recursive Model for Graph Mining. In SDM.
- Adaptive Set Intersections, Unions, and Differences. In SODA.
- HET-KG: Communication-Efficient Knowledge Graph Embedding Training via Hotness-Aware Cache. In ICDE. IEEE, 1754–1766.
- LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research (2008), 1871–1874.
- Our code and datasets. https://github.com/RocmFang/DistGER.
- How to Realize Efficient and Scalable Graph Embeddings via an Entropy-driven Mechanism. IEEE Transactions on Big Data (2022).
- HuGE: An Entropy-driven Approach to Efficient and Scalable Graph Embeddings. In ICDE.
- A. Grover and J. Leskovec. 2016. Node2vec: Scalable Feature Learning for Networks. In KDD.
- WTF: The Who to Follow Service at Twitter. In WWW.
- Inductive Representation Learning on Large Graphs. In NeurIPS.
- Accelerating Graph Sampling for Graph Machine Learning using GPUs. In EuroSys.
- Parallelizing Word2vec in Shared and Distributed Memory. IEEE Transactions on Parallel and Distributed Systems 30, 9 (2019), 2090–2100.
- G. Karypis and V. Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392.
- Community Aware Random Walk for Network Embedding. Knowledge-Based Systems 148 (2018), 47–54.
- What is Twitter, a Social Network or a News Media?. In WWW.
- Defining and Evaluating Network Communities based on Ground-truth. In ICDM.
- SmartSAGE: Training Large-Scale Graph Neural Networks Using in-Storage Processing Architectures. In ISCA (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 932–945.
- Pytorch-BigGraph: A Large Scale Graph Embedding System. In MLSys.
- An I/O-Efficient Disk-Based Graph System for Scalable Second-Order Random Walk of Large Graphs. Proc. VLDB Endow. (2022), 1619–1631.
- Walking with Perception: Efficient Random Walk Sampling via Common Neighbor Awareness. In ICDE.
- New Perspectives and Methods in Link Prediction. In KDD. 243–252.
- ReGNN: a ReRAM-based Heterogeneous Architecture for General Graph Neural Networks. In DAC. 469–474.
- Multi-Task Processing in Vertex-Centric Graph Systems: Evaluations and Insights. (2023).
- Efficient Estimation of Word Representations in Vector Space. In ICLR.
- Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS.
- Marius: Learning Massive Graph Embeddings on a Single Machine. In OSDI. 533–549.
- Unsupervised Large Graph Embedding. In AAAI.
- HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS. 693–701.
- A. Pacaci and M. T. Özsu. 2019. Experimental Analysis of Streaming Algorithms for Graph Partitioning. In SIGMOD.
- Ginex: SSD-Enabled Billion-Scale Graph Neural Network Training on a Single Machine via Provably Optimal in-Memory Caching. Proc. VLDB Endow. 15, 11 (2022), 2626–2639.
- Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SC.
- DeepWalk: Online Learning of Social Representations. In KDD.
- NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization. In WWW.
- Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In WSDM.
- Optimizing Word2Vec Performance on Multicore Systems. In IA3.
- H. Robbins and S. Monro. 1951. A Stochastic Approximation Method. Annals of Mathematical Statistics 22, 3 (1951), 400–407.
- M. Serafini. 2021. Scalable Graph Neural Network Training: The Case for Sampling. ACM SIGOPS Oper. Syst. Rev. 55, 1 (2021), 68–76.
- Heterogeneous Information Network Embedding for Recommendation. IEEE Trans. Knowl. Data Eng. 31, 2 (2019), 357–370.
- I. Stanton and G. Kliot. 2012. Streaming Graph Partitioning for Large Distributed Graphs. In KDD.
- ThunderRW: An In-Memory Graph Random Walk Engine. Proc. VLDB Endow. 14, 11 (2021), 1992–2005.
- LINE: Large-scale Information Network Embedding. In WWW.
- L. Tang and H. Liu. 2009. Scalable Learning of Collective Behavior based on Sparse Social Dimensions. In CIKM.
- VERSE: Versatile Graph Embeddings from Similarity Measures. In WWW.
- FENNEL: Streaming Graph Partitioning for Massive Scale Graphs. In WSDM.
- Deep Recursive Network Embedding with Regular Equivalence. In KDD.
- L. G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 (1990), 103–111.
- Graph Attention Networks. In ICLR.
- Structural Deep Network Embedding. In KDD.
- GraphGAN: Graph Representation Learning With Generative Adversarial Nets. In AAAI.
- Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. In SIGKDD.
- Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315 (2019).
- GraphWalker: An I/O-Efficient and Resource-Friendly Graph Analytic System for Fast and Scalable Random Walks. In USENIX ATC.
- Community Preserving Network Embedding. In AAAI.
- A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent. CoRR abs/2005.13789 (2020).
- Cross View Link Prediction by Learning Noise-resilient Representation Consensus. In WWW.
- Seastar: Vertex-centric Programming for Graph Neural Networks. In EuroSys. 359–375.
- Random Walks on Huge Graphs at Cache Efficiency. In SOSP.
- KnightKing: A Fast Distributed Graph Random Walk Engine. In SOSP.
- Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank. Proc. VLDB Endow. 13, 5 (2020), 670–683.
- R. Zafarani and H. Liu. 2009. Social Computing Data Repository at ASU. In http://socialcomputing.asu.edu.
- ProNE: Fast and Scalable Network Representation Learning. In IJCAI.
- Ontological function annotation of long non-coding RNAs through hierarchical multi-label classification. Bioinformatics 34, 10 (2018), 1750–1757.
- M. Zhang and Z. Zhou. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1338–1351.
- ByteGNN: Efficient Graph Neural Network Training at Large Scale. Proc. VLDB Endow. 15, 6 (2022), 1228–1242.
- DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. In IA3@SC.
- AliGraph: A Comprehensive Graph Neural Network Platform. Proc. VLDB Endow. (2019), 2094–2105.
- GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding. In WWW.