Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame (2401.13310v1)
Abstract: The world's largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame -- histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications.
- AdaptiveCpp “AdaptiveCpp”, 2023 URL: https://github.com/AdaptiveCpp/AdaptiveCpp
- “Exploring the Possibility of a HipSYCL-Based Implementation of OneAPI” In International Workshop on OpenCL, IWOCL’22 Bristol, United Kingdom, United Kingdom: Association for Computing Machinery, 2022 DOI: 10.1145/3529538.3530005
- “A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term” In Computer 49.5, 2016, pp. 54–63 DOI: 10.1109/MC.2016.127
- Igor Baratta, Chris Richardson and Garth Wells “Performance Analysis of Matrix-Free Conjugate Gradient Kernels Using SYCL” In International Workshop on OpenCL, IWOCL’22 Bristol, United Kingdom, United Kingdom: Association for Computing Machinery, 2022 DOI: 10.1145/3529538.3529993
- “ROOT — An object oriented data analysis framework” New Computing Techniques in Physics Research V In Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 389.1, 1997, pp. 81–86 DOI: https://doi.org/10.1016/S0168-9002(97)00048-X
- CERN “Storage: What data to record?” Retrieved on 23-11-2023, 2023 ROOT URL: https://home.cern/science/computing/storage
- Jolly Chen “ROOT GPU Histogramming development branch” Latest commit: https://github.com/jolly-chen/root/commit/7bc23323b503cdd7b3dafbf1792ca599b7048fed, 2023 ROOT URL: https://github.com/jolly-chen/root/tree/gpu_histogram_bulk
- “Evaluating the Performance of HPC-Style SYCL Applications” In Proceedings of the International Workshop on OpenCL, IWOCL ’20 Munich, Germany: Association for Computing Machinery, 2020 DOI: 10.1145/3388333.3388643
- The Khronos SYCL Working Group “SYCL 2020 Specification (revision 8)” Retrieved on 03-01-2024, 2023 The Khronos Group, Inc. URL: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html
- Enrico Guiraud “ROOT RDataFrame bulk processing development branch” Latest commit: https://github.com/eguiraud/root/commit/3cb95f7b1b321a2e24329ce8bdb88729f235ae54, 2023 ROOT URL: https://github.com/eguiraud/root/tree/df-bulk-ntuple
- Mark Harris “CUDA Pro Tip: Understand Fat Binaries and JIT Caching”, 2013 NVIDIA URL: https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/
- “A Roadmap for HEP Software and Computing R&D for the 2020s” In Computing and software for big science 3 Springer, 2019, pp. 1–49 DOI: https://doi.org/10.1007/s41781-018-0018-8
- iluhad “AdaptiveCpp compilation model”, 2023 University of Heidelberg URL: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/68f49a788c035bc16746de3e45ad383dd68aab5e/doc/compilation.md
- Intel “oneAPI DPC++ compiler”, 2023 URL: https://github.com/intel/llvm
- Intel Corporation “oneAPI DPC++ Compiler and Runtime architecture design” Retrieved on 07-01-2024, 2024 Intel Corporation URL: https://intel.github.io/llvm-docs/design/CompilerAndRuntimeDesign.html
- “Evaluating the Performance of Integer Sum Reduction on an Intel GPU” In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2021, pp. 652–655 DOI: 10.1109/IPDPSW52791.2021.00099
- Zheming Jin and Jeffrey S. Vetter “Evaluating Nonuniform Reduction in HIP and SYCL on GPUs” In 2022 IEEE/ACM 8th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD), 2022, pp. 37–43 DOI: 10.1109/DRBSD56682.2022.00010
- “Comparing SYCL data transfer strategies for tracking use cases” In Journal of Physics: Conference Series 2438.1 IOP Publishing, 2023, pp. 012018 DOI: 10.1088/1742-6596/2438/1/012018
- NVIDIA “CUDA C++ Programming Guide” Version: v12.3, 2024 NVIDIA URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#application-compatibility
- NVIDIA “CUDA Samples: reduction_kernel.cu”, 2023 URL: https://github.com/NVIDIA/cuda-samples/blob/e8568c417356f7e66bb9b7130d6be7e55324a519/Samples/2_Concepts_and_Techniques/reduction/reduction_kernel.cu
- NVIDIA “Difference between the driver and runtime APIs” Retrieved on 07-01-2024, 2023 NVIDIA URL: https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html
- “Distributed data analysis with ROOT RDataFrame” In EPJ Web Conf. 245, 2020, pp. 03009 DOI: 10.1051/epjconf/202024503009
- “RDataFrame: Easy Parallel ROOT Analysis at 100 Threads” In EPJ Web of Conferences 214, 2019, pp. 06029 DOI: 10.1051/epjconf/201921406029
- ROOT “Cling - The Interactive C++ Interpreter”, 2023 ROOT URL: https://github.com/root-project/cling
- ROOT “ROOT::Experimental::RNTuple Class Reference” Retrieved on 15-12-2023, 2023 ROOT URL: https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTuple.html
- ROOT “TTree Class Reference” Retrieved on 15-12-2023, 2023 ROOT URL: https://root.cern/doc/master/group__tutorial__dataframe.html