IRKGL16-SIMD: High-Order Symplectic Integrator
- IRKGL16-SIMD integrator is a high-order implicit Runge–Kutta method using 8-stage Gauss–Legendre collocation for structure-preserving, high-precision integration of Hamiltonian systems.
- It leverages SIMD vectorization by mapping collocation stages onto parallel SIMD registers, significantly accelerating the computation on modern hardware.
- Benchmark results demonstrate that with fixed-point iteration and optimized memory layout, it outperforms explicit symplectic methods at stringent error thresholds.
The IRKGL16-SIMD integrator is a high-order, symplectic, implicit Runge–Kutta method based on an 8-stage Gauss–Legendre collocation, specifically optimized for SIMD (Single Instruction, Multiple Data) vector processors. It is engineered for high-accuracy time integration of non-stiff Hamiltonian ODE systems in double-precision arithmetic, achieving substantial speedups on modern hardware by aligning algorithmic structure and memory layout with vectorized instructions. In numerical experiments, IRKGL16-SIMD demonstrably outperforms explicit symplectic integrators for targets of very high solution precision.
1. Mathematical Structure and Formulation
The IRKGL16-SIMD integrator is defined by the collocation method with Gauss–Legendre nodes, delivering order $2s=16$ in time. Node locations are the shifted roots of the degree-8 Legendre polynomial. The stage-value system is governed by the Butcher tableau:
$\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$
with
where are the Lagrange basis polynomials at .
The collocation equations take the form:
Implicit solves are required for the at each step. The standard approach employs a simplified Newton iteration, though the presented IRKGL16-SIMD code uses a fixed-point iteration tailored to preserve symplecticity and numerical stability.
The method is symmetric and symplectic, enforcing the algebraic condition (important for exact preservation of the symplectic structure), by parametrization and .
2. SIMD Parallelization Strategy
IRKGL16-SIMD leverages strong data-level parallelism by mapping the eight collocation stages onto the vector lanes of modern CPUs:
- With 8 stages and AVX-512 (8×64-bit), all stages for a given vector component are packed into one SIMD register.
- The method mandates a "structure-of-arrays" (SoA) data layout. For each ODE component , the vector
is held as a contiguous SIMD vector. This enables all eight stages of component to be updated in parallel.
Principal datatypes include Vec{8,Float64} for SIMD doubles, and VecArray{8,Float64,2} for the block. Arrays are stored in column-major order with stage index as the fast index.
Parallel numerical evaluation and update stages employ SIMD-enabled elementwise arithmetic and vectorized constants, ensuring all floating-point operations exploit maximal hardware-wide parallelism.
A representative fixed-point update pseudocode (Julia-like) is:
1 2 3 4 5 6 7 8 9 |
for j in 1:D
Fnj = Fn[j] # Vec{8}
Lnj = h * (b ∘ Fnj) # Vec{8}, elementwise multiplication
dYnj = μ[1]*Lnj[1]
for i in 2:8
dYnj += μ[i]*Lnj[i]
end
Yn[j] = y_n[j] + dYnj
end |
3. Implementation Considerations
- Programming Language: Julia 1.10.7, utilizing the SIMD.jl package and LLVM JIT for hardware vectorization (AVX2/AVX512/x86_64 or NEON/ARM64).
- Memory Layout: Arrays are 64-byte aligned to maximize SIMD load/store efficiency.
- Symplecticity: By encoding the coefficients as and ensuring at the kernel level, the symplectic structure is preserved up to IEEE-754 roundoff.
- Iteration scheme: The deployed implementation uses a robust fixed-point iteration (rather than Newton) with an empirical energy drift stopping criterion.
- Extensibility: A Newton-based solver could be substituted, with the Jacobian blocked for SIMD and reused between steps for additional efficiency (although not implemented in the reference code).
High efficiency is attained when , due to amortization of vector reads/writes. For , efficiency gains are markedly smaller.
4. Benchmark Performance and Numerical Results
All results are for an 11th-gen Intel i7-11850H, AVX-512 (8x64-bit SIMD), Julia with bound checks disabled.
Schwarzschild Black Hole Geodesics
- IRKGL16-SIMD attains maximum local Hamiltonian error in 0.08 s ( function calls).
- The reference explicit symplectic method (SS05 ) reaches the same error in 0.12 s.
- For error, IRKGL16-SIMD is faster than SS05.
Outer Solar System (6-body Problem)
- To achieve over days:
- IRKGL16-SIMD: s
- RKN(8) S-BM02: s
- RKN(6) S-BCE22: s
Hénon–Heiles System
- For RMS position error over :
- IRKGL16-SIMD: 0.15 s
- Splitting : 0.22 s
- Splitting : 0.30 s
| Method | Order | Time [s] @ error | Speedup vs IRKGL16-SEQ |
|---|---|---|---|
| IRKGL16-SEQ | 16 | 0.35 | 1× |
| IRKGL16-SIMD | 16 | 0.08 | 4.4× |
| SS05 (r=10) | 10 | 0.12 | 2.9× |
| S-BM02 | 6 | 0.14 | 2.5× |
These results highlight that IRKGL16-SIMD, although incurring higher per-step costs due to the implicit solve (8-stage fixed-point iterates), outpaces explicit integrators for stringent error budgets () and permits much larger timesteps for a fixed error.
5. Practical Recommendations
- Step size : For optimal accuracy, set (double precision), balancing truncation and roundoff.
- System size: SIMD efficiency is best for .
- Hardware notes:
- AVX512 gives near-maximal speedups (using all 8 stages per SIMD register).
- AVX2 achieves roughly speedup (4-wide), with software adjustments.
- ARM NEON functions analogously with Vec{4,Float32}.
- Symplectic enforcement: Always parametrize via and enforce in floating-point to guarantee algebraic symplecticity.
- Trade-offs: For moderate accuracy , explicit splitting schemes may retain an edge in raw speed. At high accuracy or long integration times, IRKGL16-SIMD is preferable due to stability and precision.
6. Applicability, Limitations, and Context
IRKGL16-SIMD is applicable to generic Hamiltonian ODEs, notably when the system is non-stiff, high-dimensional, and high-precision results are required. Unlike splitting-based explicit methods, it does not require a partition of the Hamiltonian into solvable parts, thus broadening its domain of use.
Limitations include:
- Inefficiency for low-dimensional () systems or extremely stiff ODEs (due to slower fixed-point convergence).
- Per-step overhead of implicit solves cannot be fully eliminated, but SIMD and fixed-point schemes mitigate this bottleneck.
A plausible implication is that further speedups may be achieved on multi-core or distributed memory architectures via stage-parallel methods and low-rank corrections, as demonstrated for Gauss–Legendre IRK in (Durastante et al., 23 May 2025). However, such approaches rely on additional matrix decompositions and parallelization strategies not present in the standard IRKGL16-SIMD implementation.
7. Summary and Outlook
IRKGL16-SIMD delivers a symplectic, high-order, implicit integration framework tailored for vectorized hardware. Its design ensures efficient, accurate, and structure-preserving integration of moderate-to-large-scale Hamiltonian ODEs in regimes where explicit methods either fail to reach required precision or suffer from excessive timestep constraints. The method is aligned with trends in computational hardware, exploiting SIMD at the algorithmic core to realize significant wall-time improvements. The precise balancing of high-algebraic order, IEEE-exact symplecticity, and transparent SIMD support positions IRKGL16-SIMD as the preferred integrator for demanding, long-term ODE propagation tasks on modern CPUs (Antoñana et al., 5 Nov 2025).