Morpheus DynamicMatrix Overview

Updated 20 February 2026

Morpheus DynamicMatrix is a unified abstraction that dynamically represents sparse matrices in multiple formats (CSR, ELL, HYB) to optimize computations.
It provides a consistent API and supports both manual and auto-tuning runtime selection for maximizing performance on heterogeneous systems.
Empirical benchmarks in HPCG demonstrate significant speedups, with performance improvements up to 7× on GPUs and 2.5× on CPUs.

Morpheus DynamicMatrix is a unified abstraction for dynamic sparse matrices designed to provide high productivity and performance-portability for sparse linear algebra operations on heterogeneous platforms. At its core, a Morpheus DynamicMatrix is a C++ container capable of representing a sparse matrix as one of multiple standard formats—principally CSR (Compressed Sparse Row), ELL (ELLPACK), or HYB (a hybrid combining ELL and COO)—and dynamically switching among them at runtime. This facility is central to the Morpheus library's strategy of separating the user-facing interface from platform- and problem-specific optimizations, thereby allowing end-users to benefit from runtime format selection without deep knowledge of individual sparse matrix storage formats (Stylianou et al., 2022).

1. Definition and Data Layouts

A DynamicMatrix $<T,\mathrm{Index},\mathrm{ExecSpace},\mathrm{MemSpace}>$ encapsulates the following:

CSR, ELL, and HYB (ELL+COO) format instances, with an internal enum indicating the "active" representation.
An external interface matching the APIs for the underlying formats (e.g., SpMV, element-wise update, conversion).

The data layouts for each format are as follows:

Format	Storage Arrays	Row Organization
CSR	val $[0\,..\,\mathrm{nnz}-1]$ ,	Nonzeros in row $i$ in
	col $[0\,..\,\mathrm{nnz}-1]$ ,	$[\mathrm{row\_ptr}[i],\,\mathrm{row\_ptr}[i+1])$
	row_ptr $[0\,..\,M]$
ELL	val_ell $[i][k]$ , col_ell $[i][k]$	Each row padded to $\max_{i} \mathrm{nnz\_in\_row}(i)$
HYB	$\{\text{val\_ell},\text{val\_coo}\}$	ELL for up to $K$ nonzeros/row, remainder in COO

DynamicMatrix enables seamless activation of any supported format through format conversion routines. For example:

DynamicMatrix<double,int,Cuda,Device> A;
CsrMatrix<double,int,...> A_csr = ...;
A = A_csr;              // A.active() == Format::CSR
A.activate(Format::ELL); // Converts in place to ELL

The data structure ensures all API calls route through the active format, allowing users to program against a uniform interface.

2. Internal Architecture and API

DynamicMatrix is implemented following the State and Visitor design patterns. Internally, it holds one instance of each supported sparse matrix format and a state variable designating the active format. The API provides:

Construction from any supported concrete format (CSR, ELL, HYB).
Format query (active()) and explicit activation (activate(Format f)).
Unified high-level algorithms including matrix-vector multiply (multiply()), elementwise and structural conversions (copy_from(), convert_from()).
Templated interfaces for vector and backend abstractions.
Dispatch to platform-specific or optimized kernels determined dynamically according to the active storage format.

Morpheus leverages Kokkos abstractions for execution space and memory space, ensuring compatibility across CPUs and GPUs. All functions maintain a consistent signature irrespective of the backend or storage format, which enables source code portability.

3. Dynamic Runtime Format Selection

Morpheus DynamicMatrix exposes two runtime format selection mechanisms:

Manual selection: The user can invoke A.activate(Format::ELL) or similar, or specify the format via configuration at runtime.
Automatic (auto-tuning): The library benchmarks the cost $T_{\text{run}}(f)$ of candidate formats $f \in F$ for operations such as SpMV, then selects $f^* = \arg\min_{f \in F} T_{\text{run}}(f)$ for subsequent computations. In distributed contexts (e.g., MPI), selection can be performed per-process and for local versus ghost submatrices, optimizing $T_{\text{loc}}(f_\text{loc}) + T_{\text{ghost}}(f_\text{ghost})$ .

Conversions between formats use COO as a proxy; the conversion cost $T_{\text{conv}}$ (e.g., CSR $\rightarrow$ ELL) is proportional to the number of nonzeros, $T_{\text{conv}} \approx \mathrm{nnz} \cdot c_{\text{parse}}$ , and is empirically shown to be negligible compared to $T_{\text{run}}$ . Section IV–A demonstrates that switching DynamicMatrix to CSR in HPCG incurs at most a 5% overhead, frequently yielding a minor speedup (Stylianou et al., 2022).

4. Performance Portability Benchmarks

Comprehensive benchmarking was conducted on the ARCHER2 (AMD EPYC 7742, 64-core nodes, OpenMP) and Cirrus (Dual Xeon Gold 6248 + 4×Tesla V100, CUDA) platforms. Key empirical results include:

On single-node CPUs (ARCHER2), switching from CSR to DIA format delivered up to $3.5\times$ speedup for large 27-point-stencil HPCG matrices.
On Cirrus GPUs, DIA outperformed CSR by up to $4.5\times$ .
In strong scaling regimes (512×512×256 problem, ARCHER2), DynamicMatrix achieved up to $2.5\times$ speedup over CSR-only HPCG.
With single Cirrus GPUs (384×256×128), DIA led to a $6.5\times$ acceleration, decreasing to $1.3\times$ at 8 GPUs as communication became dominant.
In weak scaling, exploiting local DIA and ghost COO placement led to $1.5\times$ (ARCHER2) and $4\times$ (Cirrus GPU) improvements.

Overall, DynamicMatrix's runtime format selection yielded $2.5\times$ (CPU) and $7\times$ (GPU) improvement in the SpMV kernel of HPCG, all achieved with no code modifications beyond the Morpheus port (Stylianou et al., 2022).

5. Porting High Performance Conjugate Gradient (HPCG) to Morpheus

The HPCG benchmark was ported to Morpheus DynamicMatrix through three main steps:

Vector replacement: Substitution of user vectors with Morpheus DenseVector $<T,\dots>$ , aliasing existing memory buffers, and direct replacement of dot and WAXPBY operations.
Matrix replacement: Replacement of HPCG's SparseMatrix (CSR-like, pointer-of-pointers) with Morpheus CsrMatrix, with elementwise conversion at setup. SpMV replaced by Morpheus::multiply.
Dynamic conversion: Conversion of CsrMatrix to DynamicMatrix at setup, activation of the runtime-selected format, and use of Morpheus::multiply.

GPU support required creation of HostMirror containers and orchestrated host-device copying for data movement, including MPI halo exchange. The total code changes amounted to $O(10^{-2})$ of the HPCG code base, isolated in the linear algebra layer. No performance regressions were detected; in pure-CSR mode, a $\sim$ 5% speedup was observed. Importantly, future support for additional data formats or backends necessitates changes only in Morpheus, not in HPCG.

6. Properties and Implications

Morpheus DynamicMatrix offers:

Unified abstraction for multiple formats (CSR, ELL, HYB, and extensible to others).
Support for shallow, deep, and element-wise copy/conversion.
API compatibility with host-device mirroring for heterogeneous systems.
Simple auto-tuning for runtime selection of format with minimal or zero overhead.
Demonstrated substantial performance benefits in real scientific codes (e.g., HPCG).

A plausible implication is that DynamicMatrix’s approach can serve as a prototype for future performance-portable linear algebra libraries targeting rapidly evolving architectures and diverse sparsity patterns. The decoupling of format selection from user code allows for incremental adoption on existing HPC codes with minimal refactoring (Stylianou et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Exploiting dynamic sparse matrices for performance portable linear algebra operations (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Morpheus DynamicMatrix.