GPU-Accelerated Pinocchio for Efficient Cosmological Modeling
- GPU-Accelerated Pinocchio offloads computation-intensive tasks to GPUs, significantly enhancing performance and energy efficiency in cosmological simulations.
- The implementation achieves up to 8× faster runtimes and energy savings, with an aggregate efficiency improvement of 64× on suitable hardware platforms.
- Integration of OpenMP-target and CUDA/OpenCL ensures cross-vendor GPU compatibility, facilitating 'green-aware' scientific computing.
GPU-Accelerated Pinocchio is an optimized version of the PINOCCHIO code—originally designed for fast generation of dark matter halo catalogues in cosmological simulations—leveraging GPU hardware to achieve dramatic improvements in throughput and energy efficiency. The approach offloads the collapse-time computation kernel, central to third-order Lagrangian Perturbation Theory (LPT), to GPUs using portable OpenMP-target or native CUDA/OpenCL implementations. Benchmarks on AMD MI250X and NVIDIA A100 clusters demonstrate up to 8× reductions in time-to-solution and energy-to-solution, culminating in an aggregate efficiency improvement of 64× on suitable hardware. Incorporating precise energy profiling via the Power Measurement Toolkit (PMT), this methodology substantiates the feasibility of "green-aware" scientific computing.
1. Algorithmic Structure and GPU Offloading
PINOCCHIO consists of three primary algorithmic stages within grid-based LPT for dark matter halo catalogue generation:
- Linear Density Field Construction: Formation of the density contrast δ(x) on a regular mesh using Fourier-space techniques.
- Collapse-Time Kernel: Calculation of the “collapse time” for each mesh cell through third-order LPT equations; in this implementation, this is the sole component ported to GPU hardware. This operation is embarrassingly parallel, as each cell’s computation is independent.
- Halo/Filament Grouping: Collapsed points are aggregated into halos and filaments employing a friends‐of‐friends–style merger tree.
On the device, both δ and are managed as contiguous 1D arrays of double-precision elements with length , facilitating efficient batched memory transfer. GPU kernels (CUDA style) iterate over cells:
1 2 3 4 5 6 7 8 9 10 11 12 |
__global__ void collapseKernel(
const double* __restrict__ delta,
double* __restrict__ tcoll,
int N3)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N3) {
double D1 = f1(delta[idx]);
double D2 = f2(delta[idx]);
tcoll[idx] = t0 / (D1 + D2 + ...);
}
} |
2. Portable Implementation and Data Management
GPU-Accelerated Pinocchio achieves portability using a dual strategy: OpenMP-target for cross-vendor (AMD/NVIDIA) deployment and native CUDA or HIP kernels for maximal device compatibility. Data layout is optimized to minimize unnecessary host–device bandwidth consumption, with only essential input/output array transfers.
There is no persistent management of particle lists within the collapse-time kernel, and the embarrassingly parallel structure allows for asynchronous execution. Overlapping computation and memory operations are attainable using CUDA/HIP streams or asynchronous OpenMP tasks; however, these mechanisms are not elaborated in the present profiling study.
3. Performance Characterization and Scaling Behavior
Benchmarks are reported on both SETONIX (AMD MI250X) and KAROLINA (NVIDIA A100) clusters. Two scaling regimes are analyzed:
- Strong Scaling: Fixed grid size , scaling compute units (CUs) from 1 to 16.
- Weak Scaling: Constant workload per CU ( per CU), with increasing total from to and CUs from 1 to 8.
Representative results (SETONIX, strong scaling):
| # CUs | T_CPU [s] | T_GPU [s] | Speedup T_CPU/T_GPU | E_CPU [kJ] | E_GPU [kJ] | Speedup E_CPU/E_GPU |
|---|---|---|---|---|---|---|
| 1 | 1200 | 155 | 7.7× | 360 | 45 | 8.0× |
| 8 | 150 | 20 | 7.5× | 360 | 45 | 8.0× |
| 16 | 75 | 10 | 7.5× | 360 | 45 | 8.0× |
For KAROLINA, the analogous speedup was approximately 2× for both runtime and energy consumption. Ideal scaling () is observed for the collapse kernel, subject to hardware interconnect limitations—specifically, PCIe bandwidth bottlenecks occur beyond 12 CUs on KAROLINA.
4. Power Profiling Methodology
Performance and energy profiling is anchored by the Power Measurement Toolkit (PMT), extended here to operate in fully parallel MPI settings. Energy-to-solution is defined via:
where denotes instantaneous power. CPU energy reading employs RAPL counters; GPU readings utilize NVML (NVIDIA) and ROCm SMI (AMD), achieving 1% measurement accuracy. The profiling workflow comprises:
- PMT instance creation per MPI rank:
PMT_CREATE(comm, rapl_on, devID, numGPUs) - Bracketing GPU computation:
PMT_CPU_START/STOP,PMT_GPU_START/STOP - Distributed counter aggregation and per-node/device results reporting.
This suggests a robust framework for reproducible, quantitative energy benchmarking in heterogeneous HPC environments.
5. Energy Efficiency Metrics and Comparative Analysis
Several metrics are employed to characterize computational sustainability:
- Energy-to-Solution:
- Energy–Delay Product (EDP): ,
- Green Productivity (GP):
with .
On AMD MI250X, simultaneous 8× reductions in runtime and energy yield a aggregate efficiency gain:
For NVIDIA A100, peak gains are lower (2× in both dimensions), producing a EDP benefit.
Scalability analysis finds that, on SETONIX, EDP remains flat or improves up to maximum tested CU counts, with the collapse kernel exhibiting excellent multi-GPU scaling. On KAROLINA, efficiency degrades at high CU counts due to PCIe congestion—this suggests hardware interconnect can be a limiting factor for GPU-accelerated LPT codes.
6. Implications for Cosmological Simulation and Green HPC
The documented efficiency gains directly address the escalating energy and cost challenges associated with large-scale cosmological N-body simulation, especially for projects such as Euclid. GPU offloading of the collapse-time kernel in PINOCCHIO provides quantifiable reductions in wall-clock time and energy footprint, supporting the tractability of expansive mock catalogue generation.
A plausible implication is that adoption by large-scale simulation consortia can substantially improve HPC resource utility and sustainability. The portable OpenMP-target model enables deployment across diverse supercomputing architectures. The combination of EDP and GP metrics furnishes a rigorous decision framework for resource managers seeking to optimize Science-per-Watt.
The PMT library’s parallel extension enables the broader scientific computing community to apply similar energy-efficiency evaluations in other HPC contexts. This suggests the emergence of a methodological standard for “green-aware” scheduling within computational cosmology and adjacent fields.
7. Research Outlook and Recommendations
The authors recommend that for future exascale cosmological mock catalogues, particularly on AMD GPU clusters, GPU-accelerated PINOCCHIO should be prioritized to realize up to reductions in environmental impact. Portable OpenMP-target implementations assure robust cross-vendor scaling, though peak gains are contingent on device-specific FP64 throughput and interconnect bandwidth.
Quantitative energy-profiling, coupled with appropriate scheduling metrics, is advocated for all high-throughput cosmological simulations. In select cases, CPU-only platforms may remain preferable, dependent on real-world workload and cluster topology. The released PMT library is positioned as a tool for community-wide reproducibility in energy-efficiency studies.
In summary, GPU-Accelerated Pinocchio substantiates the harmonious coexistence of high-performance simulation and energy sustainability, establishing methodological and practical benchmarks for future “green HPC” initiatives in scientific computing (Lacopo et al., 5 Jan 2026).