A Framework for Fine-Grained Synchronization of Dependent GPU Kernels (2305.13450v3)
Abstract: Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU. To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes.
- NVIDIA cuTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass, Accessed: 2023-07-30.
- GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pages 893–905, USA, 2011. IEEE Computer Society.
- Improving the Scalability of GPU Synchronization Primitives. IEEE Transactions on Parallel and Distributed Systems, 34(1):275–290, 2023.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems, 2022.
- Warp Scheduling for Fine-Grained Synchronization. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 375–388, 2018.
- Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385, 2015.
- Abhinav Jangda. (Artifact) A Framework for Fine-Grained Synchronization of Dependent GPU Kernels. 12 2023.
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, pages 402–416, New York, NY, USA, 2022. Association for Computing Machinery.
- Fine-Grained Synchronizations and Dataflow Programming on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pages 109–118, New York, NY, USA, 2015. Association for Computing Machinery.
- Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, page 429–431, New York, NY, USA, 2023. Association for Computing Machinery.
- Noam Shazeer. GLU Variants Improve Transformer, 2020.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2020.
- Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015.
- HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs. In 2017 IEEE International Symposium on Workload Characterization (IISWC), pages 239–249, 2017.
- LLaMA: Open and Efficient Foundation Language Models, 2023.
- Fast Fine-Grained Global Synchronization on GPUs. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pages 793–806, New York, NY, USA, 2019. Association for Computing Machinery.
- Lock-Based Synchronization for GPU Architectures. In Proceedings of the ACM International Conference on Computing Frontiers, CF ’16, page 205–213, New York, NY, USA, 2016. Association for Computing Machinery.
- HQL: A Scalable Synchronization Mechanism for GPUs. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS ’13, pages 475–486, USA, 2013. IEEE Computer Society.
- A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 483–493, 2020.