Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation (2403.16863v1)

Published 25 Mar 2024 in cs.AR and cs.AI

Abstract: LLMs have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Sorav Bansal and Alex Aiken. 2006. Automatic generation of peephole superoptimizers. SIGARCH Comput. Archit. News 34, 5 (oct 2006), 394–403. https://doi.org/10.1145/1168919.1168906
  2. cloudcores. 2024. Cuasm. https://github.com/cloudcores/CuAssembler
  3. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 16344–16359. https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf
  4. gpgpu sim. 2024. gpgpu-sim. https://github.com/gpgpu-sim/gpgpu-sim_distribution
  5. Intel. 2024. MaxAs. https://github.com/NervanaSystems/maxas
  6. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv:1903.07486 [cs.DC]
  7. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (, Koblenz, Germany,) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/3600006.3613165
  8. NVIDIA. 2024a. CUDA c++ programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
  9. NVIDIA. 2024b. NVIDIA. https://developer.nvidia.com/nsight-compute
  10. NVIDIA. 2024c. NVIDIA. https://docs.nvidia.com/nsight-visual-studio-edition/4.6/Content/Analysis/Report/CudaExperiments/KernelLevel/PerformanceCounters.htm
  11. NVIDIA. 2024d. NVIDIA CUDA compiler. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
  12. NVIDIA. 2024e. NVIDIA kepler GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
  13. NVIDIA. 2024f. NVIDIA ptx. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
  14. NVIDIA. 2024g. NVIDIA sass. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html
  15. OpanAI. 2024a. OpanAI. https://openai.com/research/video-generation-models-as-world-simulators
  16. OpanAI. 2024b. OpanAI. https://twitter.com/sama/status/1756089361609981993
  17. OpanAI. 2024c. OpanAI. https://openai.com/blog/chatgpt
  18. OpanAI. 2024d. OpanAI. https://github.com/openai/triton
  19. Pytorch2. 2024. Pytorch2. https://pytorch.org/blog/pytorch-2-paper-tutorial/
  20. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3315508.3329973
  21. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  22. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  23. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/IPDPS47924.2020.00071
  24. Optimizing batched winograd convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California) (PPoPP ’20). Association for Computing Machinery, New York, NY, USA, 32–44. https://doi.org/10.1145/3332466.3374520

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com