Porting Batched Iterative Solvers onto Intel GPUs with SYCL (2308.08417v3)
Abstract: Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, we present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.
- 2020. [CUDA] MKL compatibility. https://github.com/intel/llvm/issues/1548.
- 2023. [SYCL][CUDA] ptxas fatal: complex floating-point functions. https://github.com/intel/llvm/issues/8281.
- A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines. ACM Trans. Math. Softw. 47, 3, Article 21 (June 2021), 23Â pages. https://doi.org/10.1145/3431921
- Performance, Design, and Autotuning of Batched GEMM for GPUs. In High Performance Computing - 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9697), Julian M. Kunkel, Pavan Balaji, and Jack J. Dongarra (Eds.). Springer, 21–38. https://doi.org/10.1007/978-3-319-41321-1_2
- Batched Sparse Iterative Solvers for Computational Chemistry Simulations on GPUs. In 2021 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). 35–43. https://doi.org/10/gn3xcg
- Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX. Concurrency and Computation: Practice and Experience 34, 20 (2022), e6512. https://doi.org/10.1002/cpe.6512 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.6512
- Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The Architecture, Current State and Future Direction of HipSYCL. In Proceedings of the International Workshop on OpenCL (Munich, Germany) (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 8, 1 pages. https://doi.org/10.1145/3388333.3388658
- Batched Generation of Incomplete Sparse Approximate Inverses on GPUs. In 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). 49–56. https://doi.org/10.1109/ScalA.2016.011
- Ginkgo: A High Performance Numerical Linear Algebra Library. Journal of Open Source Software (Aug. 2020). https://doi.org/10.21105/joss.02260
- Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing. arXiv:2006.16852 [cs] (July 2020). arXiv:2006.16852Â [cs]
- Flexible Batched Sparse Matrix-Vector Product on GPUs. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Denver, Colorado) (ScalA ’17). Association for Computing Machinery, New York, NY, USA, Article 3, 8 pages. https://doi.org/10.1145/3148226.3148230
- Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning. In 2017 46th International Conference on Parallel Processing (ICPP). 91–100. https://doi.org/10.1109/ICPP.2017.18
- Standardizing Complex Numbers in SYCL. In Proceedings of the 2023 International Workshop on OpenCL (Cambridge, United Kingdom) (IWOCL ’23). Association for Computing Machinery, New York, NY, USA, Article 2, 6 pages. https://doi.org/10.1145/3585341.3585343
- ACCESS: Advancing Innovation: NSF’s Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support. In Practice and Experience in Advanced Research Computing (Portland, OR, USA) (PEARC ’23). Association for Computing Machinery, New York, NY, USA, 173–176. https://doi.org/10.1145/3569951.3597559
- Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression. Parallel Comput. 74, C (May 2018), 19–33. https://doi.org/10.1016/j.parco.2017.09.001
- A Batched GPU Methodology for Numerical Solutions of Partial Differential Equations. https://doi.org/10.48550/arXiv.2107.05395 arXiv:2107.05395Â [physics]
- A Batched GPU Methodology for Numerical Solutions of Partial Differential Equations. arXiv 2107.05395 (2021). arXiv:2107.05395Â [physics.comp-ph]
- Towards a Portable Drug Discovery Pipeline with SYCL 2020. In International Workshop on OpenCL (Bristol, United Kingdom, United Kingdom) (IWOCL’22). Association for Computing Machinery, New York, NY, USA, Article 5, 2 pages. https://doi.org/10.1145/3529538.3529688
- Tracking Performance Portability on the Yellow Brick Road to Exascale. In 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 1–13. https://doi.org/10.1109/P3HPC51967.2020.00006
- SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. 31, 3 (2005), 363–396.
- Intel. 2021. Intel oneAPI Math Kernel Library. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html. Accessed: 2021-08-24.
- Intel. 2023. oneAPI DPC++ compiler. https://github.com/intel/llvm.
- Intel Corp. 2023. oneAPI GPU Optimization Guide. https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html Accessed: Aug 2023.
- Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 14–25. https://doi.org/10.1109/P3HPC49587.2019.00007
- Batched Sparse Iterative Solvers on GPU for the Collision Operator for Fusion Plasma Simulations. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 157–167. https://doi.org/10.1109/IPDPS53621.2022.00024
- Performance Portable Batched Sparse Linear Solvers. IEEE Transactions on Parallel and Distributed Systems 34, 5 (2023), 1524–1535. https://doi.org/10.1109/TPDS.2023.3249110
- Performance Portable Batched Sparse Linear Solvers. IEEE Transactions on Parallel and Distributed Systems 34, 5 (May 2023), 1524–1535. https://doi.org/10.1109/TPDS.2023.3249110
- Reproducibility Artifact for Ginkgo’s Batched Iterative Solvers for GPUs with CUDA, HIP and SYCL Programming Models. Zenodo. https://doi.org/10.5281/ZENODO.8247538
- Scalable Parallel Programming with CUDA. In ACM SIGGRAPH 2008 Classes (Los Angeles, California) (SIGGRAPH ’08). Association for Computing Machinery, New York, NY, USA, Article 16, 14 pages. https://doi.org/10.1145/1401132.1401152
- NVIDIA. 2021. cuBLAS - Basic linear algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas. Accessed: 2021-08-24.
- Optimization of Numerous Small Dense-Matrix–Vector Multiplications in H-Matrix Arithmetic on GPU. In 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). 9–16. https://doi.org/10.1109/MCSoC.2019.00009
- Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL. https://doi.org/10.1007/978-1-4842-5574-2
- SYCL 2020: More than Meets the Eye. In Proceedings of the International Workshop on OpenCL (Munich, Germany) (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 1 pages. https://doi.org/10.1145/3388333.3388649
- Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (second ed.). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718003
- Performance of an Astrophysical Radiation Hydrodynamics Code under Scalable Vector Extension Optimization. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 545–548. https://doi.org/10.1109/CLUSTER51413.2022.00071
- cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs. Concurrency and Computation: Practice and Experience 30 (2018).
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.