Papers
Topics
Authors
Recent
Search
2000 character limit reached

IRKGL16-SIMD: High-Order Symplectic Integrator

Updated 10 November 2025
  • IRKGL16-SIMD integrator is a high-order implicit Runge–Kutta method using 8-stage Gauss–Legendre collocation for structure-preserving, high-precision integration of Hamiltonian systems.
  • It leverages SIMD vectorization by mapping collocation stages onto parallel SIMD registers, significantly accelerating the computation on modern hardware.
  • Benchmark results demonstrate that with fixed-point iteration and optimized memory layout, it outperforms explicit symplectic methods at stringent error thresholds.

The IRKGL16-SIMD integrator is a high-order, symplectic, implicit Runge–Kutta method based on an 8-stage Gauss–Legendre collocation, specifically optimized for SIMD (Single Instruction, Multiple Data) vector processors. It is engineered for high-accuracy time integration of non-stiff Hamiltonian ODE systems in double-precision arithmetic, achieving substantial speedups on modern hardware by aligning algorithmic structure and memory layout with vectorized instructions. In numerical experiments, IRKGL16-SIMD demonstrably outperforms explicit symplectic integrators for targets of very high solution precision.

1. Mathematical Structure and Formulation

The IRKGL16-SIMD integrator is defined by the collocation method with s=8s=8 Gauss–Legendre nodes, delivering order $2s=16$ in time. Node locations cic_i are the shifted roots of the degree-8 Legendre polynomial. The stage-value system is governed by the Butcher tableau:

$\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$

with

aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,

where {j}\{\ell_j\} are the Lagrange basis polynomials at {ci}\{c_i\}.

The collocation equations take the form: si=yn+hj=18aijf(sj),i=1,,8,s_i = y_n + h\sum_{j=1}^8 a_{ij}\,f(s_j), \quad i=1,\ldots,8,

yn+1=yn+hj=18bjf(sj).y_{n+1} = y_n + h\sum_{j=1}^8 b_{j}\,f(s_j).

Implicit solves are required for the sis_i at each step. The standard approach employs a simplified Newton iteration, though the presented IRKGL16-SIMD code uses a fixed-point iteration tailored to preserve symplecticity and numerical stability.

The method is symmetric and symplectic, enforcing the algebraic condition $2s=16$0 (important for exact preservation of the symplectic structure), by parametrization $2s=16$1 and $2s=16$2.

2. SIMD Parallelization Strategy

IRKGL16-SIMD leverages strong data-level parallelism by mapping the eight collocation stages onto the vector lanes of modern CPUs:

  • With 8 stages and AVX-512 (8×64-bit), all stages for a given vector component are packed into one SIMD register.
  • The method mandates a "structure-of-arrays" (SoA) data layout. For each ODE component $2s=16$3, the vector

$2s=16$4

is held as a contiguous SIMD vector. This enables all eight stages of component $2s=16$5 to be updated in parallel.

Principal datatypes include Vec{8,Float64} for SIMD doubles, and VecArray{8,Float64,2} for the $2s=16$6 block. Arrays are stored in column-major order with stage index as the fast index.

Parallel numerical evaluation and update stages employ SIMD-enabled elementwise arithmetic and vectorized constants, ensuring all floating-point operations exploit maximal hardware-wide parallelism.

A representative fixed-point update pseudocode (Julia-like) is:

{j}\{\ell_j\}1 All variables such as $2s=16$7 and $2s=16$8 are also SIMD vectors.

3. Implementation Considerations

  • Programming Language: Julia 1.10.7, utilizing the SIMD.jl package and LLVM JIT for hardware vectorization (AVX2/AVX512/x86_64 or NEON/ARM64).
  • Memory Layout: Arrays are 64-byte aligned to maximize SIMD load/store efficiency.
  • Symplecticity: By encoding the coefficients as $2s=16$9 and ensuring cic_i0 at the kernel level, the symplectic structure is preserved up to IEEE-754 roundoff.
  • Iteration scheme: The deployed implementation uses a robust fixed-point iteration (rather than Newton) with an empirical energy drift stopping criterion.
  • Extensibility: A Newton-based solver could be substituted, with the cic_i1 Jacobian blocked for SIMD and reused between steps for additional efficiency (although not implemented in the reference code).

High efficiency is attained when cic_i2, due to amortization of vector reads/writes. For cic_i3, efficiency gains are markedly smaller.

4. Benchmark Performance and Numerical Results

All results are for an 11th-gen Intel i7-11850H, AVX-512 (8x64-bit SIMD), Julia with bound checks disabled.

Schwarzschild Black Hole Geodesics

  • IRKGL16-SIMD attains cic_i4 maximum local Hamiltonian error in 0.08 s (cic_i5 function calls).
  • The reference explicit symplectic method (SS05 cic_i6) reaches the same error in 0.12 s.
  • For cic_i7 error, IRKGL16-SIMD is cic_i8 faster than SS05.

Outer Solar System (6-body Problem)

  • To achieve cic_i9 over $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$0 days:
    • IRKGL16-SIMD: $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$1 s
    • RKN(8) S$\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$2-BM02: $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$3 s
    • RKN(6) S$\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$4-BCE22: $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$5 s

Hénon–Heiles System

  • For RMS position error $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$6 over $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$7:
    • IRKGL16-SIMD: 0.15 s
    • Splitting $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$8: 0.22 s
    • Splitting $\begin{array}{c|cccccccc} c_1 & a_{11} & a_{12} & \cdots & a_{18} \ c_2 & a_{21} & a_{22} & \cdots & a_{28} \ \vdots & \vdots & & & \vdots \ c_8 & a_{81} & a_{82} & \cdots & a_{88} \ \hline & b_1 & b_2 & \cdots & b_8 \end{array}$9: 0.30 s
Method Order Time [s] @ aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,0 error Speedup vs IRKGL16-SEQ
IRKGL16-SEQ 16 0.35
IRKGL16-SIMD 16 0.08 4.4×
SS05 (r=10) 10 0.12 2.9×
Saij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,1-BM02 6 0.14 2.5×

These results highlight that IRKGL16-SIMD, although incurring higher per-step costs due to the implicit solve (8-stage fixed-point iterates), outpaces explicit integrators for stringent error budgets (aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,2) and permits much larger timesteps for a fixed error.

5. Practical Recommendations

  • Step size aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,3: For optimal accuracy, set aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,4 (double precision), balancing truncation and roundoff.
  • System size: SIMD efficiency is best for aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,5.
  • Hardware notes:
    • AVX512 gives near-maximal speedups (using all 8 stages per SIMD register).
    • AVX2 achieves roughly aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,6 speedup (4-wide), with software adjustments.
    • ARM NEON functions analogously with Vec{4,Float32}.
  • Symplectic enforcement: Always parametrize via aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,7 and enforce aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,8 in floating-point to guarantee algebraic symplecticity.
  • Trade-offs: For moderate accuracy aij=0cij(τ)dτ,bj=01j(τ)dτ,a_{ij} = \int_{0}^{c_i} \ell_j(\tau)\,d\tau, \qquad b_j = \int_{0}^1 \ell_j(\tau)\,d\tau,9, explicit splitting schemes may retain an edge in raw speed. At high accuracy or long integration times, IRKGL16-SIMD is preferable due to stability and precision.

6. Applicability, Limitations, and Context

IRKGL16-SIMD is applicable to generic Hamiltonian ODEs, notably when the system is non-stiff, high-dimensional, and high-precision results are required. Unlike splitting-based explicit methods, it does not require a partition of the Hamiltonian into solvable parts, thus broadening its domain of use.

Limitations include:

  • Inefficiency for low-dimensional ({j}\{\ell_j\}0) systems or extremely stiff ODEs (due to slower fixed-point convergence).
  • Per-step overhead of implicit solves cannot be fully eliminated, but SIMD and fixed-point schemes mitigate this bottleneck.

A plausible implication is that further speedups may be achieved on multi-core or distributed memory architectures via stage-parallel methods and low-rank corrections, as demonstrated for Gauss–Legendre IRK in (Durastante et al., 23 May 2025). However, such approaches rely on additional matrix decompositions and parallelization strategies not present in the standard IRKGL16-SIMD implementation.

7. Summary and Outlook

IRKGL16-SIMD delivers a symplectic, high-order, implicit integration framework tailored for vectorized hardware. Its design ensures efficient, accurate, and structure-preserving integration of moderate-to-large-scale Hamiltonian ODEs in regimes where explicit methods either fail to reach required precision or suffer from excessive timestep constraints. The method is aligned with trends in computational hardware, exploiting SIMD at the algorithmic core to realize significant wall-time improvements. The precise balancing of high-algebraic order, IEEE-exact symplecticity, and transparent SIMD support positions IRKGL16-SIMD as the preferred integrator for demanding, long-term ODE propagation tasks on modern CPUs (Antoñana et al., 5 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IRKGL16-SIMD Integrator.