Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPU-Accelerated Pinocchio for Efficient Cosmological Modeling

Updated 12 January 2026
  • GPU-Accelerated Pinocchio offloads computation-intensive tasks to GPUs, significantly enhancing performance and energy efficiency in cosmological simulations.
  • The implementation achieves up to 8× faster runtimes and energy savings, with an aggregate efficiency improvement of 64× on suitable hardware platforms.
  • Integration of OpenMP-target and CUDA/OpenCL ensures cross-vendor GPU compatibility, facilitating 'green-aware' scientific computing.

GPU-Accelerated Pinocchio is an optimized version of the PINOCCHIO code—originally designed for fast generation of dark matter halo catalogues in cosmological simulations—leveraging GPU hardware to achieve dramatic improvements in throughput and energy efficiency. The approach offloads the collapse-time computation kernel, central to third-order Lagrangian Perturbation Theory (LPT), to GPUs using portable OpenMP-target or native CUDA/OpenCL implementations. Benchmarks on AMD MI250X and NVIDIA A100 clusters demonstrate up to 8× reductions in time-to-solution and energy-to-solution, culminating in an aggregate efficiency improvement of 64× on suitable hardware. Incorporating precise energy profiling via the Power Measurement Toolkit (PMT), this methodology substantiates the feasibility of "green-aware" scientific computing.

1. Algorithmic Structure and GPU Offloading

PINOCCHIO consists of three primary algorithmic stages within grid-based LPT for dark matter halo catalogue generation:

  1. Linear Density Field Construction: Formation of the density contrast δ(x) on a regular N3N^3 mesh using Fourier-space techniques.
  2. Collapse-Time Kernel: Calculation of the “collapse time” tcoll(x)t_{\mathrm{coll}}(x) for each mesh cell through third-order LPT equations; in this implementation, this is the sole component ported to GPU hardware. This operation is embarrassingly parallel, as each cell’s computation is independent.
  3. Halo/Filament Grouping: Collapsed points are aggregated into halos and filaments employing a friends‐of‐friends–style merger tree.

On the device, both δ and tcollt_{\mathrm{coll}} are managed as contiguous 1D arrays of double-precision elements with length N3N^3, facilitating efficient batched memory transfer. GPU kernels (CUDA style) iterate over cells:

1
2
3
4
5
6
7
8
9
10
11
12
__global__ void collapseKernel(
    const double* __restrict__ delta,
    double* __restrict__ tcoll,
    int N3)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N3) {
        double D1 = f1(delta[idx]);
        double D2 = f2(delta[idx]);
        tcoll[idx] = t0 / (D1 + D2 + ...);
    }
}
OpenMP-target regions yield analogous functionality, providing vendor portability across AMD and NVIDIA GPU platforms. Host–device transfers—for both input δ and output tcollt_{\mathrm{coll}}—are executed once per collapse calculation, with no intermediate staging or inner-loop data exchange.

2. Portable Implementation and Data Management

GPU-Accelerated Pinocchio achieves portability using a dual strategy: OpenMP-target for cross-vendor (AMD/NVIDIA) deployment and native CUDA or HIP kernels for maximal device compatibility. Data layout is optimized to minimize unnecessary host–device bandwidth consumption, with only essential input/output array transfers.

There is no persistent management of particle lists within the collapse-time kernel, and the embarrassingly parallel structure allows for asynchronous execution. Overlapping computation and memory operations are attainable using CUDA/HIP streams or asynchronous OpenMP tasks; however, these mechanisms are not elaborated in the present profiling study.

3. Performance Characterization and Scaling Behavior

Benchmarks are reported on both SETONIX (AMD MI250X) and KAROLINA (NVIDIA A100) clusters. Two scaling regimes are analyzed:

  • Strong Scaling: Fixed grid size N=7683N=768^3, scaling compute units (CUs) from 1 to 16.
  • Weak Scaling: Constant workload per CU (N3N^3 per CU), with increasing total NN from 5123512^3 to 102431024^3 and CUs from 1 to 8.

Representative results (SETONIX, strong scaling):

# CUs T_CPU [s] T_GPU [s] Speedup T_CPU/T_GPU E_CPU [kJ] E_GPU [kJ] Speedup E_CPU/E_GPU
1 1200 155 7.7× 360 45 8.0×
8 150 20 7.5× 360 45 8.0×
16 75 10 7.5× 360 45 8.0×

For KAROLINA, the analogous speedup was approximately 2× for both runtime and energy consumption. Ideal scaling (T(NCU)T(1)/NCUT(N_{\mathrm{CU}}) \simeq T(1)/N_{\mathrm{CU}}) is observed for the collapse kernel, subject to hardware interconnect limitations—specifically, PCIe bandwidth bottlenecks occur beyond \sim12 CUs on KAROLINA.

4. Power Profiling Methodology

Performance and energy profiling is anchored by the Power Measurement Toolkit (PMT), extended here to operate in fully parallel MPI settings. Energy-to-solution EE is defined via:

E=0TP(t)dtE = \int_0^T P(t) dt

where P(t)P(t) denotes instantaneous power. CPU energy reading employs RAPL counters; GPU readings utilize NVML (NVIDIA) and ROCm SMI (AMD), achieving \sim1% measurement accuracy. The profiling workflow comprises:

  1. PMT instance creation per MPI rank: PMT_CREATE(comm, rapl_on, devID, numGPUs)
  2. Bracketing GPU computation: PMT_CPU_START/STOP, PMT_GPU_START/STOP
  3. Distributed counter aggregation and per-node/device results reporting.

This suggests a robust framework for reproducible, quantitative energy benchmarking in heterogeneous HPC environments.

5. Energy Efficiency Metrics and Comparative Analysis

Several metrics are employed to characterize computational sustainability:

  • Energy-to-Solution: E=Pavg×TE = P_{\mathrm{avg}} \times T
  • Energy–Delay Product (EDP): EDPw=E×TwEDP_w = E \times T^w, w{1,2,3}w \in \{1,2,3\}
  • Green Productivity (GP):

GP=T0/TNEN/E0GP = \frac{T_0 / T_N}{E_N / E_0}

with α=1\alpha=1.

On AMD MI250X, simultaneous 8× reductions in runtime and energy yield a 64×64\times aggregate efficiency gain:

(TCPU/TGPU)×(ECPU/EGPU)=8×8=64(T_{\mathrm{CPU}}/T_{\mathrm{GPU}}) \times (E_{\mathrm{CPU}}/E_{\mathrm{GPU}}) = 8 \times 8 = 64

For NVIDIA A100, peak gains are lower (2× in both dimensions), producing a 4×4\times EDP benefit.

Scalability analysis finds that, on SETONIX, EDP remains flat or improves up to maximum tested CU counts, with the collapse kernel exhibiting excellent multi-GPU scaling. On KAROLINA, efficiency degrades at high CU counts due to PCIe congestion—this suggests hardware interconnect can be a limiting factor for GPU-accelerated LPT codes.

6. Implications for Cosmological Simulation and Green HPC

The documented efficiency gains directly address the escalating energy and cost challenges associated with large-scale cosmological N-body simulation, especially for projects such as Euclid. GPU offloading of the collapse-time kernel in PINOCCHIO provides quantifiable reductions in wall-clock time and energy footprint, supporting the tractability of expansive mock catalogue generation.

A plausible implication is that adoption by large-scale simulation consortia can substantially improve HPC resource utility and sustainability. The portable OpenMP-target model enables deployment across diverse supercomputing architectures. The combination of EDP and GP metrics furnishes a rigorous decision framework for resource managers seeking to optimize Science-per-Watt.

The PMT library’s parallel extension enables the broader scientific computing community to apply similar energy-efficiency evaluations in other HPC contexts. This suggests the emergence of a methodological standard for “green-aware” scheduling within computational cosmology and adjacent fields.

7. Research Outlook and Recommendations

The authors recommend that for future exascale cosmological mock catalogues, particularly on AMD GPU clusters, GPU-accelerated PINOCCHIO should be prioritized to realize up to 64×64\times reductions in environmental impact. Portable OpenMP-target implementations assure robust cross-vendor scaling, though peak gains are contingent on device-specific FP64 throughput and interconnect bandwidth.

Quantitative energy-profiling, coupled with appropriate scheduling metrics, is advocated for all high-throughput cosmological simulations. In select cases, CPU-only platforms may remain preferable, dependent on real-world workload and cluster topology. The released PMT library is positioned as a tool for community-wide reproducibility in energy-efficiency studies.

In summary, GPU-Accelerated Pinocchio substantiates the harmonious coexistence of high-performance simulation and energy sustainability, establishing methodological and practical benchmarks for future “green HPC” initiatives in scientific computing (Lacopo et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPU-Accelerated Pinocchio.