HongTu: Scalable Full-Graph GNN Training on Multiple GPUs (via communication-optimized CPU data offloading) (2311.14898v1)
Abstract: Full-graph training on graph neural networks (GNN) has emerged as a promising training method for its effectiveness. Full-graph training requires extensive memory and computation resources. To accelerate this training process, researchers have proposed employing multi-GPU processing. However the scalability of existing frameworks is limited as they necessitate maintaining the training data for every layer in GPU memory. To efficiently train on large graphs, we present HongTu, a scalable full-graph GNN training system running on GPU-accelerated platforms. HongTu stores vertex data in CPU memory and offloads training to GPUs. HongTu employs a memory-efficient full-graph training framework that reduces runtime memory consumption by using partition-based training and recomputation-caching-hybrid intermediate data management. To address the issue of increased host-GPU communication caused by duplicated neighbor access among partitions, HongTu employs a deduplicated communication framework that converts the redundant host-GPU communication to efficient inter/intra-GPU data access. Further, HongTu uses a cost model-guided graph reorganization method to minimize communication overhead. Experimental results on a 4XA100 GPU server show that HongTu effectively supports billion-scale full-graph GNN training while reducing host-GPU data communication by 25%-71%. Compared to the full-graph GNN system DistGNN running on 16 CPU nodes, HongTu achieves speedups ranging from 7.8X to 20.2X. For small graphs where the training data fits into the GPUs, HongTu achieves performance comparable to existing GPU-based GNN systems.
- TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016. 265–283.
- AMD. 2022. AMD Infinity Architecture. https://www.amd.com/en/technologies/infinity-architecture.
- Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004). ACM Press, Manhattan, USA, 595–601.
- DGCL: an efficient communication library for distributed GNN training. In EuroSys ’21: Sixteenth European Conference on Computer Systems, Online Event, United Kingdom, April 26-28, 2021, Antonio Barbalace, Pramod Bhatotia, Lorenzo Alvisi, and Cristian Cadar (Eds.). ACM, 130–144.
- Training Deep Nets with Sublinear Memory Cost. CoRR abs/1604.06174 (2016). arXiv:1604.06174 http://arxiv.org/abs/1604.06174
- Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. 1232–1240.
- Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 3837–3845.
- DGL 2020. Deep Graph Library:towards efficient and scalable deep learning on graphs. https://www.dgl.ai/.
- Euler 2019. Euler. https://github.com/alibaba/euler/wiki/System-Introduction.
- Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. CoRR abs/1903.02428 (2019). arXiv:1903.02428 http://arxiv.org/abs/1903.02428
- Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1603–1618.
- LargeEA: Aligning Entities for Large-scale Knowledge Graphs. PVLDB 15, 2 (2022), 237–245.
- Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques. In ISCA ’22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 916–931.
- Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. 1024–1034. http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs
- Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 1024–1034.
- Open Graph Benchmark: Datasets for Machine Learning on Graphs. https://doi.org/10.48550/ARXIV.2005.00687
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 103–112.
- Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020.
- Chapter 4 The traveling salesman problem. In Network Models. Handbooks in Operations Research and Management Science, Vol. 7. Elsevier, 225–330. https://doi.org/10.1016/S0927-0507(05)80121-5
- George Karypis and Vipin Kumar. 1996. Parallel Multilevel Graph Partitioning. In Proceedings of IPPS ’96, The 10th International Parallel Processing Symposium, April 15-19, 1996, Honolulu, Hawaii, USA. 314–319.
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
- Jure Leskovec and Rok Sosič. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Transactions on Intelligent Systems and Technology (TIST) 8, 1 (2016), 1.
- HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. Proc. VLDB Endow. 9, 14 (2016), 1647–1658.
- Gated Graph Sequence Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
- Gated Graph Sequence Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- PaGraph: Scaling GNN training on large graphs via computation-aware caching. In SoCC ’20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020, Rodrigo Fonseca, Christina Delimitrou, and Beng Chin Ooi (Eds.). ACM, 401–415.
- Hyperbolic Graph Neural Networks. CoRR abs/1910.12892 (2019).
- Unsupervised Entity Alignment for Temporal Knowledge Graphs. In WWW. 2528–2538.
- NeuGraph: Parallel Deep Neural Network Computation on Large Graphs. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10-12, 2019. 443–458.
- Diego Marcheggiani and Ivan Titov. 2017. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, 1506–1515.
- DistGNN: scalable distributed training for large-scale graph neural networks. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 76.
- Microsoft. 2020. Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scalemodel-training-for-everyone.
- EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs. Proc. VLDB Endow. 14, 2 (2020), 114–127.
- Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture. Proc. VLDB Endow. 14, 11 (2021), 2087–2100.
- PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019. 1–15.
- Memory-Efficient Pipeline-Parallel DNN Training. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 7937–7947.
- NVIDIA. 2022a. NVIDIA A100 TENSOR CORE GPU. https://www.nvidia.com/en-us/data-center/a100/.
- NVIDIA. 2022b. NVIDIA Collective Communication Library. https://developer.nvidia.com/nccl.
- NVIDIA. 2022c. NVLink and NVSwitch. https://www.nvidia.com/en-sg/data-center/nvlink/.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada. 8024–8035.
- SANCUS: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks. Proc. VLDB Endow. 15, 9 (2022), 1937–1950.
- PyTorch 2020. Tensors and Dynamic neural networks in Python with strong GPU acceleration. https://pytorch.org/.
- ZeRO: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 20.
- ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 59.
- ZeRO-Offload: Democratizing Billion-Scale Model Training. In 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, Irina Calciu and Geoff Kuenning (Eds.). USENIX Association, 551–564.
- Subway: minimizing data transfer during out-of-GPU-memory graph processing. In EuroSys ’20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020, Angelos Bilas, Kostas Magoutis, Evangelos P. Markatos, Dejan Kostic, and Margo I. Seltzer (Eds.). ACM, 12:1–12:16.
- GraphReduce: processing large-scale graphs on accelerator-based systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, Jackie Kern and Jeffrey S. Vetter (Eds.). ACM, 28:1–28:12.
- Learning Multiagent Communication with Backpropagation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. 2244–2252.
- Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14-16, 2021. 495–514.
- Reducing communication in graph neural network training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020. 70.
- Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
- PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 26:1–26:17.
- Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. CoRR abs/1909.01315 (2019). arXiv:1909.01315 http://arxiv.org/abs/1909.01315
- NeutronStar: Distributed GNN Training with Hybrid Dependency Management. In SIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022. ACM, 1301–1315.
- Graph Neural Networks in Recommender Systems: A Survey. CoRR abs/2011.02260 (2020).
- A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596
- How Powerful are Graph Neural Networks?. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=ryGs6iA5Km
- WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (SC). IEEE Computer Society, Los Alamitos, CA, USA, 767–780.
- GNNLab: a factored system for sample-based GNN training over GPUs. In EuroSys ’22: Seventeenth European Conference on Computer Systems, Rennes, France, April 5 - 8, 2022. ACM, 417–434.
- DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. CoRR abs/2010.05337 (2020).
- Scaph: Scalable GPU-Accelerated Graph Processing with Value-Driven Differential Scheduling. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, Ada Gavrilovska and Erez Zadok (Eds.). USENIX Association, 573–588.
- AliGraph: A Comprehensive Graph Neural Network Platform. PVLDB 12, 12 (2019), 2094–2105. https://doi.org/10.14778/3352063.3352127
- Gemini: A Computation-Centric Distributed Graph Processing System. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016. 301–316.