ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code (2306.13002v4)
Abstract: Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equality saturation, allow for exhaustive term rewriting at various levels of inputs, thereby simplifying compiler design. In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach realizes less computation, less memory access, and high memory throughput simultaneously. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.
- Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA.
- Andrew W. Appel. 1997. Static Single-Assignment Form. Cambridge University Press, 427–467. https://doi.org/10.1017/CBO9780511811449.020
- The OpenMP ARB. 1997. OpenMP. https://www.openmp.org/
- Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). Association for Computing Machinery, New York, NY, USA, Article 32, 14 pages. https://doi.org/10.1145/3243176.3243196
- Bridging control-centric and data-centric optimization. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2023). 173–185.
- Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 81, 14 pages. https://doi.org/10.1145/3295500.3356173
- Omni Compiler Project (RIKEN CCS). 2009. XcodeML. https://omni-compiler.org/xcodeml.html
- Progressive raising in multi-level IR. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 15–26. https://doi.org/10.1109/CGO51591.2021.9370332
- A versatile software systolic execution model for GPU memory-bound kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 53, 81 pages. https://doi.org/10.1145/3295500.3356162
- The CLAW DSL: Abstractions for performance portable weather and climate models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. https://doi.org/10.1145/3218176.3218226
- Standard Performance Evaluation Corporation. 2014. SPEC ACCEL®. https://www.spec.org/accel/
- Leonardo de Moura and Nikolaj Bjørner. 2007. Efficient E-Matching for SMT Solvers. In Automated Deduction – CADE-21, Frank Pfenning (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 183–198.
- H. Carter Edwards and Daniel Sunderland. 2012. Kokkos array performance-portable manycore programming model. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM ’12). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/2141702.2141703
- The racket manifesto. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
- John Forrest and Robin Lougee-Heimer. 2005. CBC User Guide. Chapter Chapter 10, 257–277. https://doi.org/10.1287/educ.1053.0020 arXiv:https://pubsonline.informs.org/doi/pdf/10.1287/educ.1053.0020
- Q-Gym: An equality saturation framework for DNN inference exploiting weight repetition. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT ’22). Association for Computing Machinery, New York, NY, USA, 291–303. https://doi.org/10.1145/3559009.3569673
- Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach. SIGPLAN Not. 53, 2 (March 2018), 139–153. https://doi.org/10.1145/3296957.3173182
- GNU Project. 2023. GCC, the GNU Compiler Collection. https://gcc.gnu.org/
- High-performance symbolic-numerics via multiple dispatch. ACM Commun. Comput. Algebra 55, 3 (Jan. 2022), 92–96. https://doi.org/10.1145/3511528.3511535
- NeuroVectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020). Association for Computing Machinery, New York, NY, USA, 242–255. https://doi.org/10.1145/3368826.3377928
- Paul Havlak. 1994. Construction of thinned gated single-assignment form. In Languages and Compilers for Parallel Computing, Utpal Banerjee, David Gelernter, Alex Nicolau, and David Padua (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 477–499.
- Optimization techniques for GPU programming. ACM Comput. Surv. 55, 11, Article 239 (March 2023), 81 pages. https://doi.org/10.1145/3570638
- GPU code optimization using abstract kernel emulation and sensitivity analysis. SIGPLAN Not. 53, 4 (June 2018), 736–751. https://doi.org/10.1145/3296979.3192397
- Dissecting the NVidia Turing T4 GPU via microbenchmarking. https://doi.org/10.48550/ARXIV.1903.07486
- Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. https://doi.org/10.48550/ARXIV.1804.06826
- Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
- CCAMP: An integrated translation and optimization framework for OpenACC and OpenMP. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, Article 98, 17 pages.
- JACC: An OpenACC runtime framework with kernel-level and multi-GPU parallelization. In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE Computer Society, Los Alamitos, CA, USA, 182–191. https://doi.org/10.1109/HiPC53243.2021.00032
- A symbolic emulator for shuffle synthesis on the NVIDIA PTX code. In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC 2023). Association for Computing Machinery, New York, NY, USA, 110–121. https://doi.org/10.1145/3578360.3580253
- Autotuning OpenACC work distribution via direct search. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure (XSEDE). Association for Computing Machinery, New York, NY, USA, Article 38, 8 pages. https://doi.org/10.1145/2792745.2792783
- Gordon E. Moore. 1998. Cramming more components onto integrated circuits. Proc. IEEE 86, 1 (Jan. 1998), 82–85. https://doi.org/10.1109/JPROC.1998.658762
- NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
- NVIDIA Corporation. 2022. Programming Guide :: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- NVIDIA Corporation. 2023. High Performance Computing (HPC) SDK | NVIDIA. https://developer.nvidia.com/hpc-sdk
- The OpenACC Organization. 2011. OpenACC. https://www.openacc.org/
- SRTuner: Effective compiler optimization customization by exposing synergistic relations. In Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’22). IEEE Press, 118–130. https://doi.org/10.1109/CGO53902.2022.9741263
- PyTorch: An imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA.
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176
- Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). Association for Computing Machinery, New York, NY, USA, 168–182. https://doi.org/10.1145/3178487.3178500
- ∇∇\nabla∇SD: Differentiable programming for sparse tensors. arXiv:cs.PL/2303.07030
- Pure tensor program rewriting via access patterns (representation pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2021). Association for Computing Machinery, New York, NY, USA, 21–31. https://doi.org/10.1145/3460945.3464953
- Can Fortran’s ‘do concurrent’ replace directives for accelerated computing?. In Accelerator Programming Using Directives, Sridutt Bhalachandra, Christopher Daley, and Verónica Melesse Vergara (Eds.). Springer International Publishing, Cham, 3–21.
- Equality Saturation: A new approach to optimization. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’09). Association for Computing Machinery, New York, NY, USA, 264–276. https://doi.org/10.1145/1480881.1480915
- The Khronos Group Inc. 2023. OpenCL Overview - The Khronos Group Inc. https://www.khronos.org/api/opencl
- The LLVM Project. 2023. Clang C Language Family Frontend for LLVM. https://clang.llvm.org/
- Optimizing GPU register usage: Extensions to OpenACC and compiler optimizations. In 2016 45th International Conference on Parallel Processing (ICPP). 572–581. https://doi.org/10.1109/ICPP.2016.72
- TOP500.org. 2022. November 2022 | TOP500. https://www.top500.org/lists/top500/2022/11/
- MLGO: A machine learning guided compiler optimizations framework. arXiv:cs.PL/2101.04808
- Vectorization for Digital Signal Processors via equality saturation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA, 874–886. https://doi.org/10.1145/3445814.3446707
- Memory performance of AMD EPYC Rome and Intel Cascade Lake SP server processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering (ICPE ’22). Association for Computing Machinery, New York, NY, USA, 165–175. https://doi.org/10.1145/3489525.3511689
- A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 41–53. https://doi.org/10.1145/2749469.2750399
- SPORES: Sum-product optimization via relational equality saturation for large scale linear algebra. Proc. VLDB Endow. 13, 12 (July 2020), 1919–1932. https://doi.org/10.14778/3407790.3407799
- Egg: Fast and extensible equality saturation. Proc. ACM Program. Lang. 5, POPL, Article 23 (Jan. 2021), 29 pages. https://doi.org/10.1145/3434304
- High performance compilers for parallel computing. Addison-Wesley Longman Publishing Co., Inc., USA.
- GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 564–576. https://doi.org/10.1109/HPCA.2015.7056063
- Nas parallel benchmarks for GPGPUs using a directive-based programming model. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 67–81. https://doi.org/10.1007/978-3-319-17473-0_5
- Equality saturation for tensor graph superoptimization. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 255–268. https://proceedings.mlsys.org/paper_files/paper/2021/file/65ded5353c5ee48d0b7d48c591b8f430-Paper.pdf
- Hamid Reza Zohouri and Satoshi Matsuoka. 2019. The memory controller wall: Benchmarking the Intel FPGA SDK for OpenCL memory interface. In 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 11–18. https://doi.org/10.1109/H2RC49586.2019.00007