SPACE Centre of Excellence (CoE) Overview
- SPACE Centre of Excellence is a pan-European, multi-institutional initiative that re-engineers astrophysical and cosmological simulation codes for exascale high-performance computing.
- It integrates advanced computational paradigms, machine learning, and high-performance data analytics to drive breakthroughs in astrophysics, cosmology, and space plasma physics.
- The initiative optimizes flagship codes and hardware performance on platforms like Leonardo by employing techniques such as kernel offloading, asynchronous operations, and dataflow scheduling.
The SPACE Centre of Excellence (CoE) is a pan-European, multi-institutional initiative dedicated to modernizing, scaling, and integrating astrophysical and cosmological simulation and analysis codes for exascale high-performance computing systems. Its mandate encompasses code re-engineering, innovative computational paradigms, high-performance data analysis, machine learning integration, and community-driven software infrastructure to advance large-scale scientific discovery in astrophysics, cosmology, and space plasma physics (Shukla et al., 28 Oct 2025, Shukla et al., 21 Dec 2025).
1. Mission, Structure, and Participating Entities
The SPACE CoE’s primary objective is to position Europe's flagship astrophysics and cosmology simulation and analysis codes for exascale computing environments, defined by operations rates of ops/s and heterogeneous hardware (GPUs, many-core CPUs, hierarchical memory). The CoE was launched under EuroHPC and Horizon Europe (Grant No. 101093441), uniting 17 institutions across eight countries (Shukla et al., 21 Dec 2025):
- Principal partners: CINECA (Italy, overall coordination, resource management, code optimization), IT4Innovations (Czech Republic, performance profiling, POP3 CoE metrics), University of Turin, INAF–Trieste, LMU Munich, KU Leuven, E4 Computer Engineering; with contributions from software, plasma physics, and hardware vendors (ENGINSOFT, CRAL‐CNRS‐ENS Lyon, SiPearl via EPI, EUPEX, UNIGE).
- Governance: A steering committee led by CINECA; technical work packages address code modernization, hardware/software co-design, data/ML, and training.
- Funding: EuroHPC Joint Undertaking, with additional national contributions from Belgium, Czech Republic, France, Germany, Greece, Italy, Norway, and Spain.
The collaboration integrates code authors, astrophysicists, high-performance computing (HPC) experts, hardware manufacturers, visualization/ML tool developers, and performance engineers, enabling sustainable, portable, and scalable workflows (Shukla et al., 28 Oct 2025, Shukla et al., 21 Dec 2025).
2. Scientific Drivers and Targeted Simulation Codes
SPACE’s scientific agenda targets multi-physics models resolving hydrodynamics, magnetohydrodynamics (MHD), radiative transfer, and kinetic plasma phenomena across cosmic and stellar scales, producing data at petabyte scales. These simulations require thousands to millions of GPUs or many-core CPUs (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025).
The CoE focuses on the following seven astrophysical and cosmological flagships:
| Code | Domain/Physics | Programming Model |
|---|---|---|
| Pluto (gPLUTO) | (Relativistic) HD, (Relativistic) MHD | MPI+OpenACC GPU offload |
| OpenGadget3 | N-body, SPH, radiative transfer | MPI+OpenMP+OpenACC |
| Ramses | AMR hydro + gravity | MPI+OpenMP hybrid |
| PIC3D (iPIC3D) | Implicit Particle-in-Cell plasma | MPI+CUDA/HIP |
| ChaNGa | N-body, SPMHD | Charm++, GPU |
| BHAC | GRMHD, AMR | MPI+OpenACC |
| FIL/GRACE | GRMHD on evolving spacetimes | MPI+OpenMP, Kokkos |
The codes are systematically re-engineered to exploit, when available, hardware accelerators (NVIDIA A100, AMD MI250, ARM-based Rhea with HBM), with new execution paradigms favoring task-based/dataflow models for heterogeneous architectures (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025).
3. Platforms, Software Environments, and Programming Paradigms
SPACE-CoE benchmarks and optimizes performance on EuroHPC flagship platforms, notably Leonardo (CINECA), MareNostrum 5, MeluXina, and LUMI. Leonardo is illustrative, with a "Booster" partition of 3,456 nodes, each having 4× NVIDIA A100 (40 GB HBM2e) and a Dragonfly+ HDR network delivering pre-exascale capability (≈240 PFlop/s, mixed precision) (Shukla et al., 28 Oct 2025). Software stacks leverage Spack-managed modules, GNU/Intel/NVIDIA compilers, OpenMPI 4.1.x, CUDA/HIP, and emerging performance-portable frameworks.
Key programming paradigms and middleware include (Shukla et al., 21 Dec 2025):
- MPI+X: X = OpenMP, OpenACC, CUDA, HIP, Kokkos, for mapping kernels to accelerators/many-core CPUs.
- Task/dataflow scheduling: E.g., Charm++ (ChaNGa), dynamic OpenMP domains (Ramses) to enhance communication-computation overlap and cache locality.
- Asynchronous/non-blocking operations: Device-host transfers, MPI_Irecv/Isend, CUDA streams.
- Performance-portability and workflow orchestration: Hecuba (in situ data staging), ParaView/Catalyst (visualization), StreamFlow.
Performance modeling applies classical scalings:
where is wall-clock time on ranks, network latency, inverse bandwidth, and message size (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025).
4. Porting, Optimization, and Scalability Achievements
SPACE employs rigorous profiling (Extrae/Paraver, POP3 CoE metrics for CPUs; NVIDIA Nsight Systems/Compute for GPUs) to identify bottlenecks and guide optimization (Shukla et al., 28 Oct 2025). Kernel offloading, communication-computation overlap, and optimized memory layouts underpin performance gains.
Representative results:
- gPLUTO: Single-node GPU (A100) achieves 9.6× acceleration over CPU for an Orszag-Tang vortex test (312 s vs 2,982 s); weak scaling shows ≈88% efficiency up to 512 GPUs; strong scaling E(N) ≈ 0.89–0.95 (depending on test) (Shukla et al., 28 Oct 2025, Shukla et al., 21 Dec 2025).
- OpenGadget3: Gravity-only and hydrodynamical runs at particles reach parallel efficiency >80% up to resource increase (and 16× for ). A neighbor-bunching tree-walk kernel provides up to 10× speedup for gravity calculation (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025).
- iPIC3D: Particle mover and moment gatherer achieve 40× and 100× GPU speedups, respectively, with total code 40× faster than CPU (Shukla et al., 28 Oct 2025). Weak scaling E(N) ≈ 78% to 1,024 GPUs.
- Other codes: Ramses achieves >80% efficiency at 1,024 ranks with hybrid MPI+OpenMP; ChaNGa exceeds 85% efficiency on 65,536 cores using overdecomposition; BHAC reaches ~95% GPU scaling, net 20× vs CPU (Shukla et al., 21 Dec 2025).
Optimization techniques include: fine-grained load balancing, kernel fusion, stencil-aware loop tiling, vectorization (SVE/NEON, AVX512), asynchronous I/O, and data layout tuning for high-bandwidth memory (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025).
5. Data Analysis, Machine Learning, and Visualization Integration
Handling the massive data scales ( particles/cells) necessitates advanced analysis and visualization pipelines (Shukla et al., 21 Dec 2025):
- In situ/in transit visualization: Hecuba enables co-running simulation and analysis, reducing I/O by >90%.
- Batch/offline analysis: VisIVO (via StreamFlow) provides scalable visualization; Blender pipelines support cinematic volume rendering.
- Machine learning: Post-hoc representation learning via Spherinator/HiPSter enables clustering and anomaly detection. Physics surrogate models include neural surrogates trained on high-fidelity runs for rapid property inference and radiative-transfer emulators embedded into OpenGadget3, replacing explicit ray tracing with <1% error and ∼10× speedup.
A standardized I/O ecosystem leverages HDF5 with uniform metadata (IVOA, FAIR), permitting interoperability and automated data discovery (Shukla et al., 21 Dec 2025).
6. Sustainability, Software Ecosystem, and Community Outreach
Sustainability is achieved via version-controlled GitLab/GitHub repositories, CI/CD testing of MPI/OpenMP/GPU builds, containerization (Singularity, Apptainer, Spack recipes), and a common metadata schema (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025). Regression tests and ReFrame-based performance tracking ensure code stability across hardware generations.
Community engagement is central, involving regular webinars, summer schools, hackathons, and detailed documentation. Deliverables (profiling, energy-efficiency reports) are published biannually. The development roadmap targets full GPU/HBM porting to Rhea-class systems, ML-infused simulation workflows, federated notebook platforms for analysis, and training on emerging parallel paradigms (SYCL, OpenMP 6.0, MPI+SHMEM) (Shukla et al., 21 Dec 2025).
7. Challenges, Bottlenecks, and Future Prospects
Major identified bottlenecks include memory-bound tree traversals in OpenGadget3, CPU-bound GMRES solvers in iPIC3D, and ghost cell exchange overhead in gPLUTO. Proposed solutions involve coalesced memory access, asynchronous collectives, GPU-friendly Krylov solvers, and fully asynchronous CPU–GPU pipelines. The extension of the software stack, performance regression, and campaign scaling to O(10k) GPUs on upcoming exascale EuroHPC systems is a priority (Shukla et al., 28 Oct 2025).
Future plans encompass the porting of missing features (AMR, radiative cooling), deeper ML integration, federated data analysis, and coordinated training/outreach, ensuring that European astrophysics remains at the computational frontier (Shukla et al., 21 Dec 2025, Shukla et al., 28 Oct 2025).