Ginkgo Library: Sparse Linear Algebra & HPC

Updated 9 November 2025

Ginkgo Library is an open-source C++ framework for sparse numerical linear algebra that unifies linear operators, iterative solvers, and preconditioners.
Its architecture separates high-level algorithms from low-level device kernels using an Executor abstraction to manage resources on CPUs, NVIDIA, AMD, and SYCL devices.
Performance benchmarks show near-peak efficiency on diverse hardware, making Ginkgo ideal for extreme-scale scientific applications and heterogeneous computing.

Ginkgo is a modern open-source C++ library for sparse numerical linear algebra, architected to provide high performance, extensibility, and platform portability across CPUs and all major GPU architectures. Its core abstraction models all mathematical operations as "linear operators," enabling composability of matrix–vector products, iterative solvers, and preconditioners under a unified interface. Ginkgo is designed as a foundational building block for extreme-scale scientific simulation codes, allowing seamless deployment and acceleration on heterogeneous systems, with particular attention to software sustainability and ease of integration into complex application stacks.

1. Architectural Foundations and Core Abstractions

Ginkgo's architecture is based on a radical separation between high-level numerical algorithms and low-level, platform-specific kernels. The primary interface is the abstract class gko::LinOp, which models any linear map $L : V \to W$ , encompassing matrices, solvers, and preconditioners. Composition of operators, operator sums, and parametrized factories are first-class constructs, supporting combinatorial and recursive assembly of solver pipelines (Anzt et al., 2020).

All device interactions—memory allocation, kernel launches, data transfer—are encapsulated by the polymorphic gko::Executor class hierarchy. Executors abstract resource management for specific backends: OpenMP and serial CPUs (OmpExecutor, ReferenceExecutor), NVIDIA GPUs (CudaExecutor), AMD GPUs (HipExecutor), and Intel/SYCL devices (DpcppExecutor) (Cojean et al., 2020). This interface ensures that end-user and application code is insulated from device-specific details, eliminating the need for device annotations or preprocessor branches.

The library’s modular structure consists of:

Core: LinOp type hierarchy, solvers (CG, GMRES, BiCGStab, etc.), preconditioners (Jacobi, ILU, SPAI), stopping criteria, and memory abstractions.
Backends: Performance-critical kernels tailored for each device, behind the Executor interface. These include hand-optimized SpMV, vector, reduction, and factorization routines, with separation of “common” device code to minimize duplication between CUDA and HIP/ROCm (Tsai et al., 2020).

Smart pointer ownership (using share, lend, give decorators) provides deterministic resource management congruent with modern C++ idioms. Zero-copy array_view constructs facilitate seamless data exchange between host application and device memory spaces.

2. Supported Platforms, Backends, and Portability

Ginkgo delivers both software and performance portability across computing architectures:

CPU: Multi-threaded OpenMP backend and reference serial executor.
NVIDIA GPU: Native CUDA backend, utilizing all major CUDA kernel launches and stream management features.
AMD GPU: HIP backend, with direct ROCm/HIP kernel launches and AMD-specific optimizations; CUDA–HIP codebase harmonization is achieved via "common" shared device kernels and architecture-sensitive compile-time parameters (Tsai et al., 2020).
Intel GPU/SYCL: DpcppExecutor invokes the SYCL programming model, supporting deployment on Intel GPUs and other SYCL-conformant devices (Cojean et al., 2020, Nguyen et al., 2023).

At compile time, users select which backends to build. At runtime, an appropriate Executor is instantiated (e.g., gko::CudaExecutor::create, gko::OmpExecutor::create), and all subsequent allocations and operations are bound to the device associated with that executor, enabling flexible, backend-agnostic application logic. No preprocessor-level selection or device qualifiers are visible in solver code.

Portability is realized at two levels: (i) the code runs unmodified on multiple architectures, and (ii) each backend achieves near-peak resource utilization, typically 70–90% of platform roofline bounds in SpMV and Krylov methods (Cojean et al., 2020).

3. Sparse Linear Algebra Functionality and Programming Interface

Core sparse kernels are provided for CSR, COO, ELL, and hybrid formats. All matrix, solver, and preconditioner types adhere to the LinOp interface, supporting:

Matrix–vector operations (e.g., SpMV): Both $y = Ax$ and $y = \alpha Ax + \beta y$ , with kernel specializations for performance (Anzt et al., 2020).

Iterative solvers: CG, BiCG, BiCGStab, CGS, GMRES (with restart), FCG; each is created via factory patterns and accepts customizable stopping criteria, e.g.:

auto solver = gko::solver::Cg<>::build()
    .with_criteria(
        gko::stop::Iteration::build().with_max_iters(1000),
        gko::stop::ResidualNorm<>::build().with_reduction_factor(1e-8)
    )
    .with_preconditioner(gko::preconditioner::Ilu<>::build())
    ->on(exec)
    ->generate(A);
solver->apply(b.get(), x.get());

Preconditioners: Block-Jacobi (with block-adaptive precision), ILU(0), ParILU, ParILUT, and various SPAI variants, all LinOpFactories.
Batched solvers: The BatchCsr, BatchEll, and Batched CG/GMRES solvers permit simultaneous solution of large ensembles of small linear systems, optimized for single-SYCL-kernel execution (Nguyen et al., 2023).
Extensibility: New matrix formats, solvers, and preconditioners can be added via template/mixin mechanisms, with minimal code duplication.

Numerical operations exploit advanced device features, including subwarp/subwavefront cooperative groups (with Ginkgo’s custom abstraction bridging CUDA/HIP differences), shared local memory (SLM), and device-level reductions. Device-specific details are confined to backend modules.

4. Performance Evaluation and Portability Benchmarks

Comprehensive benchmarks demonstrate Ginkgo’s backend kernels are highly competitive or superior relative to vendor libraries, often approaching architectural roofline limits:

SpMV (SuiteSparse matrices, double precision): On NVIDIA V100, Ginkgo achieves ~135 GFLOP/s; on A100, ~220 GFLOP/s (cf. 920 GB/s and 1400 GB/s STREAM bandwidth ceilings); on AMD MI100, ~138 GFLOP/s (80% of bound) (Cojean et al., 2020).
Krylov solvers (CG, BiCGSTAB, GMRES): V100 median throughput ~65 GFLOP/s; A100 up to 140 GFLOP/s; MI100 ~65 GFLOP/s; Intel Gen.9, 1.4–2.0 GFLOP/s (within 50% of theoretical bound for integrated GPU) (Cojean et al., 2020).
Batched solvers (SYCL/Intel PVC): On production combustion workloads, SYCL batched solvers on the Max 1550 Intel GPU achieve 2.4× lower time-to-solution compared to CUDA on NVIDIA H100 for batch sizes up to $2^{17}$ (Nguyen et al., 2023). SLM-centric memory layouts and kernel fusion strategies produce linear scaling in batch and matrix size, with up to 65% SLM utilization.
Runtime overhead: Multiple dynamic dispatches in solver loops contribute negligible cost (~1.3 μs/iteration on “empty” test problems).

Performance portability is confirmed by empirical close-to-roofline efficiency, with careful device-specific code minimizing the difference between native and cross-compiled kernels (e.g., <3–10% performance hit when running HIP-compiled code) (Tsai et al., 2020).

5. Integration Patterns and Scientific Application Deployment

Ginkgo is designed for flexible integration into existing simulation frameworks. There are two principal approaches:

Loose coupling: Major simulation codes (e.g., HiOP, openCARP, PeleLM, OpenFOAM via OGL) define abstract solver interfaces and implement backends using both PETSc/SUNDIALS and Ginkgo. Selection is performed at configure- or run-time; data may undergo a single layout copy at construction, with all runtime computation using zero-copy array_views (Koch et al., 19 Sep 2025). Adapter size is minimal: HiOP (~470 LOC for CG/LU), openCARP (each layer ~300–500 LOC), SUNDIALS (Ginkgo adapter ~3,500 LOC).
Tight coupling: Standalone plugin libraries (e.g., OGL for OpenFOAM, ~7,000 LOC) directly manipulate Ginkgo types for maximum runtime efficiency, at the expense of deeper dependency on Ginkgo’s API and object model.

Integration with complex frameworks (MFEM, deal.ii) is supported via lightweight wrappers, and all type operations (allocation, application, destruction) are handled via the Executor mechanism without embedded hardware-specific logic (Anzt et al., 2020). Python integration is provided by pyGinkgo—a pybind11-powered wrapper that presents nearly 1:1 bindings to Ginkgo, supporting NumPy/PyTorch zero-copy data exchange and outperforming all Python-native sparse linear algebra frameworks on both CPU and GPU (Tuteja et al., 9 Oct 2025).

6. Software Sustainability, Extensibility, and Future Directions

Software sustainability is central in Ginkgo’s design:

Resource management: RAII and smart pointers ensure deterministic resource and memory management (no manual free/new).
Regression safety: All backends execute identical unit, integration, and CI test suites on CPU and all GPU targets.
Code maintainability: Device-specific kernels are kept as small, focused modules; a "common" device code layer handles kernel logic shared between CUDA and HIP (Tsai et al., 2020). Template mixins (EnableLinOp, etc.) and modern C++14 facilitate boilerplate reduction and extensibility (Anzt et al., 2020).
Documentation and release process: Ginkgo follows a clear, versioned release model. APIs expose raw pointers and array views, enabling downstream codes to pin to stable versions or increment at minor release levels.
Platform readiness: Recent work has established strong SYCL support on Intel GPUs; ongoing development is extending full conformance to variability in device math libraries and complex-number support (Nguyen et al., 2023).

Extending to further architectures or new solver algorithms involves subclassing Executor, implementing corresponding device kernels, and registering them via operation/factory interfaces—minimal disruption to core algorithms is required.

7. Representative Applications and Scientific Impact

Ginkgo underlies a range of production simulation workflows:

Low-Mach number combustion (PeleLM): Batched GMRES/CG solvers for independent small linear systems, achieving 2–3× acceleration by offloading to GPU (Koch et al., 19 Sep 2025, Nguyen et al., 2023).
Power grid optimization (ExaSGD/HiOP): Replacement of CPU-only solvers (MA57) with Ginkgo GPU-resident direct/LU solvers produced significant end-to-end performance improvements for ACOPF analysis.
Cardiac electrophysiology (openCARP): A loose-coupling adapter stack enables use of Ginkgo or PETSc for linear solves.
CFD (OpenFOAM via OGL): Native Ginkgo plugins allow OpenFOAM codes to leverage GPU kernels without refactoring solver logic.
pyGinkgo sparse machine learning models: PyTorch models can embed arbitrary sparse weight matrices as Ginkgo LinOps, resulting in superior SpMV throughput and reducing end-to-end training and inference times by factors of 2–14 over native Python libraries (Tuteja et al., 9 Oct 2025).

These cases demonstrate that Ginkgo’s blend of abstraction, backend modularity, and API transparency facilitates practical, maintainable adoption in extreme-scale and high-performance settings, establishing Ginkgo as a central infrastructure element for next-generation, performance-portable scientific computing.