SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation (2403.16863v1)
Abstract: LLMs have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.
- Sorav Bansal and Alex Aiken. 2006. Automatic generation of peephole superoptimizers. SIGARCH Comput. Archit. News 34, 5 (oct 2006), 394–403. https://doi.org/10.1145/1168919.1168906
- cloudcores. 2024. Cuasm. https://github.com/cloudcores/CuAssembler
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 16344–16359. https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf
- gpgpu sim. 2024. gpgpu-sim. https://github.com/gpgpu-sim/gpgpu-sim_distribution
- Intel. 2024. MaxAs. https://github.com/NervanaSystems/maxas
- Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv:1903.07486 [cs.DC]
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (, Koblenz, Germany,) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/3600006.3613165
- NVIDIA. 2024a. CUDA c++ programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- NVIDIA. 2024b. NVIDIA. https://developer.nvidia.com/nsight-compute
- NVIDIA. 2024c. NVIDIA. https://docs.nvidia.com/nsight-visual-studio-edition/4.6/Content/Analysis/Report/CudaExperiments/KernelLevel/PerformanceCounters.htm
- NVIDIA. 2024d. NVIDIA CUDA compiler. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
- NVIDIA. 2024e. NVIDIA kepler GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
- NVIDIA. 2024f. NVIDIA ptx. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
- NVIDIA. 2024g. NVIDIA sass. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html
- OpanAI. 2024a. OpanAI. https://openai.com/research/video-generation-models-as-world-simulators
- OpanAI. 2024b. OpanAI. https://twitter.com/sama/status/1756089361609981993
- OpanAI. 2024c. OpanAI. https://openai.com/blog/chatgpt
- OpanAI. 2024d. OpanAI. https://github.com/openai/triton
- Pytorch2. 2024. Pytorch2. https://pytorch.org/blog/pytorch-2-paper-tutorial/
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3315508.3329973
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/IPDPS47924.2020.00071
- Optimizing batched winograd convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California) (PPoPP ’20). Association for Computing Machinery, New York, NY, USA, 32–44. https://doi.org/10.1145/3332466.3374520