Distributed Matrix-Based Sampling for Graph Neural Network Training (2311.02909v3)
Abstract: Graph Neural Networks (GNNs) offer a compact and computationally efficient way to learn embeddings and classifications on graph data. GNN models are frequently large, making distributed minibatch training necessary. The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we introduce a pipeline that uses our matrix-based bulk sampling approach to provide end-to-end training results. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on $128$ GPUs, and show that our pipeline is $2.5\times$ faster than Quiver (a distributed extension to PyTorch-Geometric) on a $3$-layer GraphSAGE network. On datasets outside of OGB, we show a $8.46\times$ speedup on $128$ GPUs in per-epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms.
- HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research, 46(6):e33–e33, 01 2018. doi: 10.1093/nar/gkx1313. URL https://doi.org/10.1093/nar/gkx1313.
- Communication optimal parallel multiplication of sparse random matrices. pp. 222–231, 2013.
- The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496–509, 2011.
- Challenges and advances in parallel sparse matrix-matrix multiplication. In The 37th International Conference on Parallel Processing (ICPP’08), pp. 503–510, Portland, Oregon, USA, September 2008. doi: 10.1109/ICPP.2008.45. URL http://eecs.berkeley.edu/~aydin/Buluc-ParallelMatMat.pdf.
- Dsp: Efficient gnn training with multiple gpus. PPoPP ’23, pp. 392–404, 2023.
- Communication-free distributed gnn training with vertex cut, 2023.
- Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247, 2018.
- Corporation, N. NCCL: Optimized primitives for collective multi-gpu communication. https://github.com/NVIDIA/nccl, 2023.
- Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pp. 551–568. USENIX Association, July 2021. ISBN 978-1-939133-22-9. URL https://www.usenix.org/conference/osdi21/presentation/gandhi.
- Integrated model, batch, and domain parallelism in training neural networks. In SPAA’18: 30th ACM Symposium on Parallelism in Algorithms and Architectures, 2018.
- Inductive representation learning on large graphs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 1024–1034. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs.pdf.
- Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
- Parallel weighted random sampling. volume 48, pp. 1–40. ACM, 2022.
- Accelerating graph sampling for graph machine learning using gpus. In EuroSys ’21: Proceedings of the Sixteenth European Conference on Computer Systems, pp. 311–326. ACM, 2021.
- Improving the accuracy, scalability, and performance of graph neural networks with ROC. In Proceedings of Machine Learning and Systems (MLSys), pp. 187–198. 2020.
- Pglbox: Multi-gpu graph learning framework for web-scale recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, pp. 4262–4272, 2023.
- Communication-efficient graph neural networks with probabilistic neighborhood expansion analysis and caching. In Proceedings of Machine Learning and Systems, 2023.
- Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In Proceedings of the IPDPS, 2016.
- NeuGraph: Parallel deep neural network computation on large graphs. In USENIX Annual Technical Conference (USENIX ATC 19), pp. 443–458, Renton, WA, 2019. USENIX Association. ISBN 978-1-939133-03-8.
- Adaptive multi-level blocking optimization for sparse matrix vector multiplication on gpu. Procedia Comput. Sci., 80(C):131–142. ISSN 1877-0509. doi: 10.1016/j.procs.2016.05.304. URL https://doi.org/10.1016/j.procs.2016.05.304.
- Fast inverse transform sampling in one and two dimensions. arXiv preprint arXiv:1307.1223, 2013.
- C-saw: A framework for graph sampling and random walk on gpus. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE, 2020.
- PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
- Team, Q. Torch-quiver: Pytorch library for fast and easy distributed graph learning. https://github.com/quiver-team/torch-quiver, 2023.
- Reducing communication in graph neural network training. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE, 2020.
- BNS-GCN: Efficient full-graph training of graph convolutional networks with partition-parallelism and random boundary node sampling. Proceedings of Machine Learning and Systems, 4:673–693, 2022a.
- PipeGCN: Efficient full-graph training of graph convolutional networks with pipelined feature communication. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=kSwqMH0zn1F.
- A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
- Graphblast: A high-performance linear algebra-based graph framework on the gpu. ACM Transactions on Mathematical Software, 48(1):1–51, 2022a.
- Wholegraph: A fast graph neural network training framework with multi-gpu distributed shared memory architecture. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22, 2022b.
- Gnnlab: A factored system for sample-based gnn training over gpus. In Proceedings of the Seventeenth European Conference on Computer Systems, EuroSys ’22, pp. 417–434, 2022c.
- Knightking: A fast distributed graph random walk engine. 2019.
- Distdgl: distributed graph neural network training for billion-scale graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pp. 36–44. IEEE, 2020.
- AliGraph: a comprehensive graph neural network platform. Proceedings of the VLDB Endowment, 12(12):2094–2105, 2019.
- Layer-dependent importance sampling for training deep and large graph convolutional networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2019.
- Alok Tripathy (6 papers)
- Katherine Yelick (29 papers)
- Aydin Buluc (62 papers)