Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core (2308.07487v1)
Abstract: In atomistic spin dynamics simulations, the time cost of constructing the space- and time-displaced pair correlation function in real space increases quadratically as the number of spins $N$, leading to significant computational effort. The GEMM subroutine can be adopted to accelerate the calculation of the dynamical spin-spin correlation function, but the computational cost of simulating large spin systems ($>40000$ spins) on CPUs remains expensive. In this work, we perform the simulation on the graphics processing unit (GPU), a hardware solution widely used as an accelerator for scientific computing and deep learning. We show that GPUs can accelerate the simulation up to 25-fold compared to multi-core CPUs when using the GEMM subroutine on both. To hide memory latency, we fuse the element-wise operation into the GEMM kernel using $\mathtt{CUTLASS}$ that can improve the performance by 26% $\sim$ 33% compared to implementation based on $\mathtt{cuBLAS}$. Furthermore, we perform the on-the-fly calculation in the epilogue of the GEMM subroutine to avoid saving intermediate results on global memory, which makes the large-scale atomistic spin dynamics simulation feasible and affordable.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.