IRKGL16-SIMD: High-Order Symplectic Integrator

Updated 10 November 2025

IRKGL16-SIMD integrator is a high-order implicit Runge–Kutta method using 8-stage Gauss–Legendre collocation for structure-preserving, high-precision integration of Hamiltonian systems.
It leverages SIMD vectorization by mapping collocation stages onto parallel SIMD registers, significantly accelerating the computation on modern hardware.
Benchmark results demonstrate that with fixed-point iteration and optimized memory layout, it outperforms explicit symplectic methods at stringent error thresholds.

The IRKGL16-SIMD integrator is a high-order, symplectic, implicit Runge–Kutta method based on an 8-stage Gauss–Legendre collocation, specifically optimized for SIMD (Single Instruction, Multiple Data) vector processors. It is engineered for high-accuracy time integration of non-stiff Hamiltonian ODE systems in double-precision arithmetic, achieving substantial speedups on modern hardware by aligning algorithmic structure and memory layout with vectorized instructions. In numerical experiments, IRKGL16-SIMD demonstrably outperforms explicit symplectic integrators for targets of very high solution precision.

1. Mathematical Structure and Formulation

The IRKGL16-SIMD integrator is defined by the collocation method with $s=8$ Gauss–Legendre nodes, delivering order $2s=16$ in time. Node locations $c_i$ are the shifted roots of the degree-8 Legendre polynomial. The stage-value system is governed by the Butcher tableau:

$\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$

with

$a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,$

where $\{\ell_j\}$ are the Lagrange basis polynomials at $\{c_i\}$ .

The collocation equations take the form: $s_i = y_n + h\sum_{j=1}^8 a_{ij}\,f(s_j), \quad i=1,\ldots,8,$

$y_{n+1} = y_n + h\sum_{j=1}^8 b_{j}\,f(s_j).$

Implicit solves are required for the $s_i$ at each step. The standard approach employs a simplified Newton iteration, though the presented IRKGL16-SIMD code uses a fixed-point iteration tailored to preserve symplecticity and numerical stability.

The method is symmetric and symplectic, enforcing the algebraic condition $b_i a_{ij} + b_j a_{ji} = b_i b_j$ (important for exact preservation of the symplectic structure), by parametrization $\mu_{ij} = a_{ij}/b_j$ and $\mu_{ji} = 1 - \mu_{ij}$ .

2. SIMD Parallelization Strategy

IRKGL16-SIMD leverages strong data-level parallelism by mapping the eight collocation stages onto the vector lanes of modern CPUs:

With 8 stages and AVX-512 (8×64-bit), all stages for a given vector component are packed into one SIMD register.
The method mandates a "structure-of-arrays" (SoA) data layout. For each ODE component $j = 1, \ldots, D$ , the vector

$Y^j = (s_1^j, s_2^j, \ldots, s_8^j)$

is held as a contiguous SIMD vector. This enables all eight stages of component $j$ to be updated in parallel.

Principal datatypes include Vec{8,Float64} for SIMD doubles, and VecArray{8,Float64,2} for the $8\times D$ block. Arrays are stored in column-major order with stage index as the fast index.

Parallel numerical evaluation and update stages employ SIMD-enabled elementwise arithmetic and vectorized constants, ensuring all floating-point operations exploit maximal hardware-wide parallelism.

A representative fixed-point update pseudocode (Julia-like) is:

for j in 1:D
    Fnj = Fn[j]           # Vec{8}
    Lnj = h * (b ∘ Fnj)   # Vec{8}, elementwise multiplication
    dYnj = μ[1]*Lnj[1]
    for i in 2:8
        dYnj += μ[i]*Lnj[i]
    end
    Yn[j] = y_n[j] + dYnj
end

All variables such as

b

and

\mu

are also SIMD vectors.

3. Implementation Considerations

Programming Language: Julia 1.10.7, utilizing the SIMD.jl package and LLVM JIT for hardware vectorization (AVX2/AVX512/x86_64 or NEON/ARM64).
Memory Layout: Arrays are 64-byte aligned to maximize SIMD load/store efficiency.
Symplecticity: By encoding the coefficients as $\mu_{ij}$ and ensuring $\mu_{ji}=1-\mu_{ij}$ at the kernel level, the symplectic structure is preserved up to IEEE-754 roundoff.
Iteration scheme: The deployed implementation uses a robust fixed-point iteration (rather than Newton) with an empirical energy drift stopping criterion.
Extensibility: A Newton-based solver could be substituted, with the $8D \times 8D$ Jacobian blocked for SIMD and reused between steps for additional efficiency (although not implemented in the reference code).

High efficiency is attained when $D \gtrsim 10$ , due to amortization of vector reads/writes. For $D < 5$ , efficiency gains are markedly smaller.

4. Benchmark Performance and Numerical Results

All results are for an 11th-gen Intel i7-11850H, AVX-512 (8x64-bit SIMD), Julia with bound checks disabled.

Schwarzschild Black Hole Geodesics

IRKGL16-SIMD attains $10^{-14}$ maximum local Hamiltonian error in 0.08 s ( $\approx 10^{6}$ function calls).
The reference explicit symplectic method (SS05 $r=10$ ) reaches the same error in 0.12 s.
For $10^{-10}$ error, IRKGL16-SIMD is $2\times$ faster than SS05.

Outer Solar System (6-body Problem)

To achieve $\max|\Delta H| \approx 10^{-12}$ $max ∣Δ H ∣ \approx 1 0^{- 12}$ over $10^{7}$ $1 0^{7}$ days:
- IRKGL16-SIMD: $\approx 1.4$ s
- RKN(8) S $_{\text{ABA}}$ -BM02: $\approx 2.1$ s
- RKN(6) S $_{\text{ABA}}$ -BCE22: $\approx 3.0$ s

Hénon–Heiles System

For RMS position error $\approx 10^{-12}$ $\approx 1 0^{- 12}$ over $2\pi\cdot 10^4$ $2 π \cdot 1 0^{4}$ :
- IRKGL16-SIMD: 0.15 s
- Splitting $r=8$ : 0.22 s
- Splitting $r=6$ : 0.30 s

Method	Order	Time [s] @ $10^{-12}$ error	Speedup vs IRKGL16-SEQ
IRKGL16-SEQ	16	0.35	1×
IRKGL16-SIMD	16	0.08	4.4×
SS05 (r=10)	10	0.12	2.9×
S $_{\text{ABA}}$ -BM02	6	0.14	2.5×

These results highlight that IRKGL16-SIMD, although incurring higher per-step costs due to the implicit solve (8-stage fixed-point iterates), outpaces explicit integrators for stringent error budgets ( $<10^{-10}$ ) and permits much larger timesteps for a fixed error.

5. Practical Recommendations

Step size $h$ : For optimal accuracy, set $h \sim 10^{-2} \ldots 10^{-3}$ (double precision), balancing truncation and roundoff.
System size: SIMD efficiency is best for $D \gtrsim 10$ .
Hardware notes:
- AVX512 gives near-maximal speedups (using all 8 stages per SIMD register).
- AVX2 achieves roughly $2\times$ speedup (4-wide), with software adjustments.
- ARM NEON functions analogously with Vec{4,Float32}.
Symplectic enforcement: Always parametrize via $\mu_{ij}$ and enforce $\mu_{ji}=1-\mu_{ij}$ in floating-point to guarantee algebraic symplecticity.
Trade-offs: For moderate accuracy $(\sim 10^{-6})$ , explicit splitting schemes may retain an edge in raw speed. At high accuracy or long integration times, IRKGL16-SIMD is preferable due to stability and precision.

6. Applicability, Limitations, and Context

IRKGL16-SIMD is applicable to generic Hamiltonian ODEs, notably when the system is non-stiff, high-dimensional, and high-precision results are required. Unlike splitting-based explicit methods, it does not require a partition of the Hamiltonian into solvable parts, thus broadening its domain of use.

Limitations include:

Inefficiency for low-dimensional ( $D<5$ ) systems or extremely stiff ODEs (due to slower fixed-point convergence).
Per-step overhead of implicit solves cannot be fully eliminated, but SIMD and fixed-point schemes mitigate this bottleneck.

A plausible implication is that further speedups may be achieved on multi-core or distributed memory architectures via stage-parallel methods and low-rank corrections, as demonstrated for Gauss–Legendre IRK in (Durastante et al., 23 May 2025). However, such approaches rely on additional matrix decompositions and parallelization strategies not present in the standard IRKGL16-SIMD implementation.

7. Summary and Outlook

IRKGL16-SIMD delivers a symplectic, high-order, implicit integration framework tailored for vectorized hardware. Its design ensures efficient, accurate, and structure-preserving integration of moderate-to-large-scale Hamiltonian ODEs in regimes where explicit methods either fail to reach required precision or suffer from excessive timestep constraints. The method is aligned with trends in computational hardware, exploiting SIMD at the algorithmic core to realize significant wall-time improvements. The precise balancing of high-algebraic order, IEEE-exact symplecticity, and transparent SIMD support positions IRKGL16-SIMD as the preferred integrator for demanding, long-term ODE propagation tasks on modern CPUs (Antoñana et al., 5 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

Stage-Parallel Implicit Runge--Kutta methods via low-rank matrix equation corrections (2025)

SIMD-vectorized implicit symplectic integrators can outperform explicit ones (2025)

Follow Topic

Get notified by email when new papers are published related to IRKGL16-SIMD Integrator.