Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code (2306.13002v4)

Published 22 Jun 2023 in cs.DC

Abstract: Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equality saturation, allow for exhaustive term rewriting at various levels of inputs, thereby simplifying compiler design. In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach realizes less computation, less memory access, and high memory throughput simultaneously. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA.
  2. Andrew W. Appel. 1997. Static Single-Assignment Form. Cambridge University Press, 427–467. https://doi.org/10.1017/CBO9780511811449.020
  3. The OpenMP ARB. 1997. OpenMP. https://www.openmp.org/
  4. Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). Association for Computing Machinery, New York, NY, USA, Article 32, 14 pages. https://doi.org/10.1145/3243176.3243196
  5. Bridging control-centric and data-centric optimization. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2023). 173–185.
  6. Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 81, 14 pages. https://doi.org/10.1145/3295500.3356173
  7. Omni Compiler Project (RIKEN CCS). 2009. XcodeML. https://omni-compiler.org/xcodeml.html
  8. Progressive raising in multi-level IR. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 15–26. https://doi.org/10.1109/CGO51591.2021.9370332
  9. A versatile software systolic execution model for GPU memory-bound kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 53, 81 pages. https://doi.org/10.1145/3295500.3356162
  10. The CLAW DSL: Abstractions for performance portable weather and climate models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. https://doi.org/10.1145/3218176.3218226
  11. Standard Performance Evaluation Corporation. 2014. SPEC ACCEL®. https://www.spec.org/accel/
  12. Leonardo de Moura and Nikolaj Bjørner. 2007. Efficient E-Matching for SMT Solvers. In Automated Deduction – CADE-21, Frank Pfenning (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 183–198.
  13. H. Carter Edwards and Daniel Sunderland. 2012. Kokkos array performance-portable manycore programming model. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM ’12). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/2141702.2141703
  14. The racket manifesto. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
  15. John Forrest and Robin Lougee-Heimer. 2005. CBC User Guide. Chapter Chapter 10, 257–277. https://doi.org/10.1287/educ.1053.0020 arXiv:https://pubsonline.informs.org/doi/pdf/10.1287/educ.1053.0020
  16. Q-Gym: An equality saturation framework for DNN inference exploiting weight repetition. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT ’22). Association for Computing Machinery, New York, NY, USA, 291–303. https://doi.org/10.1145/3559009.3569673
  17. Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach. SIGPLAN Not. 53, 2 (March 2018), 139–153. https://doi.org/10.1145/3296957.3173182
  18. GNU Project. 2023. GCC, the GNU Compiler Collection. https://gcc.gnu.org/
  19. High-performance symbolic-numerics via multiple dispatch. ACM Commun. Comput. Algebra 55, 3 (Jan. 2022), 92–96. https://doi.org/10.1145/3511528.3511535
  20. NeuroVectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020). Association for Computing Machinery, New York, NY, USA, 242–255. https://doi.org/10.1145/3368826.3377928
  21. Paul Havlak. 1994. Construction of thinned gated single-assignment form. In Languages and Compilers for Parallel Computing, Utpal Banerjee, David Gelernter, Alex Nicolau, and David Padua (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 477–499.
  22. Optimization techniques for GPU programming. ACM Comput. Surv. 55, 11, Article 239 (March 2023), 81 pages. https://doi.org/10.1145/3570638
  23. GPU code optimization using abstract kernel emulation and sensitivity analysis. SIGPLAN Not. 53, 4 (June 2018), 736–751. https://doi.org/10.1145/3296979.3192397
  24. Dissecting the NVidia Turing T4 GPU via microbenchmarking. https://doi.org/10.48550/ARXIV.1903.07486
  25. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. https://doi.org/10.48550/ARXIV.1804.06826
  26. Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
  27. CCAMP: An integrated translation and optimization framework for OpenACC and OpenMP. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, Article 98, 17 pages.
  28. JACC: An OpenACC runtime framework with kernel-level and multi-GPU parallelization. In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE Computer Society, Los Alamitos, CA, USA, 182–191. https://doi.org/10.1109/HiPC53243.2021.00032
  29. A symbolic emulator for shuffle synthesis on the NVIDIA PTX code. In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC 2023). Association for Computing Machinery, New York, NY, USA, 110–121. https://doi.org/10.1145/3578360.3580253
  30. Autotuning OpenACC work distribution via direct search. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure (XSEDE). Association for Computing Machinery, New York, NY, USA, Article 38, 8 pages. https://doi.org/10.1145/2792745.2792783
  31. Gordon E. Moore. 1998. Cramming more components onto integrated circuits. Proc. IEEE 86, 1 (Jan. 1998), 82–85. https://doi.org/10.1109/JPROC.1998.658762
  32. NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
  33. NVIDIA Corporation. 2022. Programming Guide :: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
  34. NVIDIA Corporation. 2023. High Performance Computing (HPC) SDK | NVIDIA. https://developer.nvidia.com/hpc-sdk
  35. The OpenACC Organization. 2011. OpenACC. https://www.openacc.org/
  36. SRTuner: Effective compiler optimization customization by exposing synergistic relations. In Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’22). IEEE Press, 118–130. https://doi.org/10.1109/CGO53902.2022.9741263
  37. PyTorch: An imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA.
  38. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176
  39. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). Association for Computing Machinery, New York, NY, USA, 168–182. https://doi.org/10.1145/3178487.3178500
  40. ∇∇\nabla∇SD: Differentiable programming for sparse tensors. arXiv:cs.PL/2303.07030
  41. Pure tensor program rewriting via access patterns (representation pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2021). Association for Computing Machinery, New York, NY, USA, 21–31. https://doi.org/10.1145/3460945.3464953
  42. Can Fortran’s ‘do concurrent’ replace directives for accelerated computing?. In Accelerator Programming Using Directives, Sridutt Bhalachandra, Christopher Daley, and Verónica Melesse Vergara (Eds.). Springer International Publishing, Cham, 3–21.
  43. Equality Saturation: A new approach to optimization. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’09). Association for Computing Machinery, New York, NY, USA, 264–276. https://doi.org/10.1145/1480881.1480915
  44. The Khronos Group Inc. 2023. OpenCL Overview - The Khronos Group Inc. https://www.khronos.org/api/opencl
  45. The LLVM Project. 2023. Clang C Language Family Frontend for LLVM. https://clang.llvm.org/
  46. Optimizing GPU register usage: Extensions to OpenACC and compiler optimizations. In 2016 45th International Conference on Parallel Processing (ICPP). 572–581. https://doi.org/10.1109/ICPP.2016.72
  47. TOP500.org. 2022. November 2022 | TOP500. https://www.top500.org/lists/top500/2022/11/
  48. MLGO: A machine learning guided compiler optimizations framework. arXiv:cs.PL/2101.04808
  49. Vectorization for Digital Signal Processors via equality saturation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA, 874–886. https://doi.org/10.1145/3445814.3446707
  50. Memory performance of AMD EPYC Rome and Intel Cascade Lake SP server processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering (ICPE ’22). Association for Computing Machinery, New York, NY, USA, 165–175. https://doi.org/10.1145/3489525.3511689
  51. A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 41–53. https://doi.org/10.1145/2749469.2750399
  52. SPORES: Sum-product optimization via relational equality saturation for large scale linear algebra. Proc. VLDB Endow. 13, 12 (July 2020), 1919–1932. https://doi.org/10.14778/3407790.3407799
  53. Egg: Fast and extensible equality saturation. Proc. ACM Program. Lang. 5, POPL, Article 23 (Jan. 2021), 29 pages. https://doi.org/10.1145/3434304
  54. High performance compilers for parallel computing. Addison-Wesley Longman Publishing Co., Inc., USA.
  55. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 564–576. https://doi.org/10.1109/HPCA.2015.7056063
  56. Nas parallel benchmarks for GPGPUs using a directive-based programming model. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 67–81. https://doi.org/10.1007/978-3-319-17473-0_5
  57. Equality saturation for tensor graph superoptimization. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 255–268. https://proceedings.mlsys.org/paper_files/paper/2021/file/65ded5353c5ee48d0b7d48c591b8f430-Paper.pdf
  58. Hamid Reza Zohouri and Satoshi Matsuoka. 2019. The memory controller wall: Benchmarking the Intel FPGA SDK for OpenCL memory interface. In 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 11–18. https://doi.org/10.1109/H2RC49586.2019.00007

Summary

We haven't generated a summary for this paper yet.