T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives (2401.16677v1)
Abstract: LLMs increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.
- AMD. 2018. AMD’s ROCm Communication Collectives Library. ”https://github.com/ROCmSoftwarePlatform/rccl/wiki”.
- AMD. 2019. AMD’s BLAS Library. ”https://github.com/ROCmSoftwarePlatform/rocBLAS”.
- AMD. 2020. AMD’s tool for creating a benchmark-driven backend library for GEMMs. ”https://github.com/ROCmSoftwarePlatform/Tensile/”.
- AMD. 2021. AMD HSA Code Object Format. ”https://rocmdocs.amd.com/en/latest/ROCm_Compiler_SDK/ROCm-Codeobj-format.html”.
- AMD. 2022. AMD INSTINCT™ MI210 ACCELERATOR. https://www.amd.com/en/products/server-accelerators/amd-instinct-mi210.
- DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, IEEE Press, Piscataway, NJ, USA, 1–15.
- NaviSim: A Highly Accurate GPU Simulator for AMD RDNA GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Chicago, Illinois) (PACT ’22). Association for Computing Machinery, New York, NY, USA, 333–345. https://doi.org/10.1145/3559009.3569666
- Nathan Benaich and Ian Hogarth. 2022. State of AI Report 2022. https://www.stateof.ai/.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NeurIPS, Vol. 33), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates Inc., Red Hook, NY, USA, 1877–1901.
- Synthesizing Optimal Collective Algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). Association for Computing Machinery, New York, NY, USA, 62–75. https://doi.org/10.1145/3437801.3441620
- Architecting an Energy-Efficient DRAM System for GPUs. In 23rd IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, IEEE Computer Society, Washington, DC, USA, 73–84.
- PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311 (2022), 87 pages.
- FlashAttention: Fast and Memory-efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Morristown, NJ, USA, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- Shraf Eassa and Sukru Burc Eryilmaz. 2022. The Full Stack Optimization Powering NVIDIA MLPerf Training v2.0 Performance. https://developer.nvidia.com/blog/boosting-mlperf-training-performance-with-full-stack-optimization/.
- KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, IEEE Press, Piscataway, NJ, USA, 1–12. https://doi.org/10.1109/MICRO.2016.7783716
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. The Journal of Machine Learning Research 23, 1, Article 120 (jan 2022), 39 pages.
- Automatic Fusions of CUDA-GPU Kernels for Parallel Map. SIGARCH Comput. Archit. News 39, 4 (Dec. 2011), 98–99. https://doi.org/10.1145/2082156.2082183
- Amir Gholami. 2021. AI and Memory Wall.
- Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 24th IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 608–619. https://doi.org/10.1109/HPCA.2018.00058
- Achieving Human Parity on Automatic Chinese to English News Translation. arXiv preprint arXiv:1803.05567 (March 2018), 25 pages. arXiv:1803.05567 [cs.CL]
- Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015), 12 pages. arXiv:1512.03385 http://arxiv.org/abs/1512.03385
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS, Vol. 32). Curran Associates Inc., Red Hook, NY, USA, Article 10, 10 pages.
- ARK: GPU-driven Code Execution for Distributed Deep Learning. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, Boston, MA, 87–101. https://www.usenix.org/conference/nsdi23/presentation/hwang
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Association for Computing Machinery, New York, NY, USA, 402–416. https://doi.org/10.1145/3503222.3507778
- Sylvain Jeaugey. 2022. How is tree reduction implemented? https://github.com/NVIDIA/nccl/issues/545#issuecomment-1006361565.
- Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications. In Proceedings of Workshop on General Purpose Processing using GPUs (GPGPU). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/2588768.2576780
- Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. Association for Computing Machinery, New York, NY, USA, 351–363. https://doi.org/10.1145/2896377.2901468
- Ten Lessons from Three Generations Shaped Google’s TPUv4i. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA). IEEE Press, Piscataway, NJ, USA, 1–14. https://doi.org/10.1109/ISCA52012.2021.00010
- cuTLASS: Fast linear algebra in CUDA C++.
- Exploring Modern GPU Memory System Design Challenges through Accurate Modeling. CoRR abs/1810.07269 (2018), 10 pages. arXiv:1810.07269 http://arxiv.org/abs/1810.07269
- Locality-Centric Data and Threadblock Management for Massive GPUs. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 1022–1036. https://doi.org/10.1109/MICRO50266.2020.00086
- Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, Piscataway, NJ, USA, 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
- GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent. In 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Washington, DC, USA, 14 pages.
- Scalable and Efficient MoE Training for Multitask Multilingual Models. https://doi.org/10.48550/ARXIV.2109.10465
- An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives. In ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, IEEE Computer Society, Washington, DC, USA, 996–1009.
- ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (Lake Tahoe, Nevada) (NIPS’12). Curran Associates Inc., USA, 1097–1105. http://dl.acm.org/citation.cfm?id=2999134.2999257
- Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product. In ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, Piscataway, NJ, USA, 43–56. https://doi.org/10.1109/ISCA52012.2021.00013
- Analyzing Machine Learning Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, IEEE Computer Society, Washington, DC, USA, 151–152.
- Network In Network. In 2nd International Conference on Learning Representations (ICLR), Yoshua Bengio and Yann LeCun (Eds.). OpenReview.net, 10 pages. http://arxiv.org/abs/1312.4400
- The Architectural Implications of Autonomous Driving: Constraints and Acceleration. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) (ASPLOS). ACM, New York, NY, USA, 751–766. https://doi.org/10.1145/3173162.3173191
- Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization. In Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, USA, 354–365. https://doi.org/10.1109/HPCA.2013.6522332
- Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, Boston, MA, 809–824. https://www.usenix.org/conference/nsdi23/presentation/mahajan
- MLPerf Training Benchmark. CoRR abs/1910.01500 (2019), 14 pages. arXiv:1910.01500 http://arxiv.org/abs/1910.01500
- Mixed Precision Training. arXiv:1710.03740 [cs.AI] http://arxiv.org/abs/1710.03740
- FP8 Formats for Deep Learning. CoRR abs/2209.05433 (2022), 9 pages. https://doi.org/10.48550/ARXIV.2209.05433 arXiv:2209.05433
- Microsoft. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. Microsoft Research Blog 1, 8 (2020), 8 pages. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
- MLPerf. 2018. MLPerf Benchmark Suite. https://mlperf.org/.
- AMPeD: An Analytical Model for Performance in Distributed Training of Transformers. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE Computer Society, Los Alamitos, CA, USA, 306–315. https://doi.org/10.1109/ISPASS57527.2023.00037
- GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 46–58. https://doi.org/10.1145/3466752.3480088
- Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-grained Rransfers. In ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, IEEE Computer Society, Washington, DC, USA, 139–152.
- GraphPIM: Enabling Instruction-level PIM Offloading in Graph Computing Frameworks. In IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 457–468. https://doi.org/10.1109/HPCA.2017.54
- NVIDIA. 2017. NVIDIA DGX-1 With Tesla V100 System Architecture. https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf.
- NVIDIA. 2018. NVIDIA TESLA V100 GPU ACCELERATOR. https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf.
- NVIDIA. 2020. NVIDIA NCCL.
- NVIDIA. 2021. NVIDIA A100 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.
- NVIDIA. 2022. GPUDirect. ”https://developer.nvidia. com/gpudirect”.
- NVIDIA. 2023a. Efficient GEMM in CUDA. https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#parallelized-reductions.
- NVIDIA. 2023b. NVIDIA Announces DGX GH200 AI Supercomputer. https://nvidianews.nvidia.com/news/nvidia-grace-hopper-superchips-designed-for-accelerated-generative-ai-enter-full-production.
- NVIDIA. 2023c. NVIDIA H100 TENSOR CORE GPU. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
- NVIDIA Corp. 2016. NVIDIA cuBLAS. https://developer.nvidia.com/cublas. Accessed August 6, 2016.
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting. arXiv preprint arXiv:2311.18677 (2023), 12 pages. arXiv:2311.18677 [cs.AR]
- Demystifying BERT: System Design Implications. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE Computer Society, Los Alamitos, CA, USA, 296–309. https://doi.org/10.1109/IISWC55918.2022.00033
- Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware. In IEEE International Symposium on Workload Characterization (IISWC). IEEE Computer Society, Los Alamitos, CA, USA, 140–153. https://doi.org/10.1109/IISWC59245.2023.00026
- Opportunistic Computing in GPU Architectures. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA, 210–223. https://doi.org/10.1145/3307650.3322212
- J Thomas Pawlowski. 2011. Hybrid Memory Cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HotChips). IEEE, IEEE, Piscataway, NJ, USA, 1–24.
- GPU-initiated Fine-grained Overlap of Collective Communication with Computation. arXiv preprint arXiv:2305.06942 (2023), 13 pages. arXiv:2305.06942 [cs.DC]
- Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9.
- DeepSpeed-MOE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. In International Conference on Machine Learning (ICML). PMLR, PMLR, 18332–18346.
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 59, 14 pages. https://doi.org/10.1145/3458817.3476205
- Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, IEEE Press, Piscataway, NJ, USA, 540–553. https://doi.org/10.1109/ISCA52012.2021.00049
- MLPerf Inference Benchmark. In ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, Washington, DC, USA, 446–459. https://doi.org/10.1109/ISCA45697.2020.00045
- A Generalist Agent. Transactions on Machine Learning Research 2022 (2022), 42 pages. https://openreview.net/forum?id=1ikK0kHjvj
- Kyle Roarty and Matthew D. Sinclair. 2020. Modeling Modern GPU Applications in gem5. In 3rd gem5 Users’ Workshop. 2 pages.
- Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NeurIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 861, 11 pages.
- Aarush Selvan and Pankaj Kanwar. 2022. Google showcases Cloud TPU v4 Pods for large model training. https://cloud.google.com/blog/topics/tpus/google-showcases-cloud-tpu-v4-pods-for-large-model-training.
- TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, Boston, MA, 593–612. https://www.usenix.org/conference/nsdi23/presentation/shah
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs/1909.08053 (2019), 9 pages. arXiv:1909.08053 [cs.CL] http://arxiv.org/abs/1909.08053
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations (ICLR), Yoshua Bengio and Yann LeCun (Eds.). OpenReview.net, 14 pages. http://arxiv.org/abs/1409.1556
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530b, a Large-scale Generative Language Model. arXiv preprint arXiv:2201.11990 (2022), 44 pages. arXiv:2201.11990 [cs.CL]
- Modular Array-Based GPU Computing in a Dynamically-Typed Language. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (Barcelona, Spain) (ARRAY 2017). Association for Computing Machinery, New York, NY, USA, 48–55. https://doi.org/10.1145/3091966.3091974
- Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Press, Piscataway, NJ, USA, 1–9.
- Rethinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Press, Piscataway, NJ, USA, 2818–2826. https://doi.org/10.1109/CVPR.2016.308
- Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NeurIPS). Curran Associates, Inc., Red Hook, NY, USA, 6000–6010. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing (GREENCOM-CPSCOM ’10). IEEE Computer Society, USA, 344–350. https://doi.org/10.1109/GreenCom-CPSCom.2010.102
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS). Association for Computing Machinery, New York, NY, USA, 93–106. https://doi.org/10.1145/3567955.3567959
- Toward Human Parity in Conversational Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (Sept 2017), 2410–2423.
- Suchita Pati (8 papers)
- Shaizeen Aga (12 papers)
- Mahzabeen Islam (6 papers)
- Nuwan Jayasena (7 papers)
- Matthew D. Sinclair (16 papers)