Morpheus DynamicMatrix Overview
- Morpheus DynamicMatrix is a unified abstraction that dynamically represents sparse matrices in multiple formats (CSR, ELL, HYB) to optimize computations.
- It provides a consistent API and supports both manual and auto-tuning runtime selection for maximizing performance on heterogeneous systems.
- Empirical benchmarks in HPCG demonstrate significant speedups, with performance improvements up to 7× on GPUs and 2.5× on CPUs.
Morpheus DynamicMatrix is a unified abstraction for dynamic sparse matrices designed to provide high productivity and performance-portability for sparse linear algebra operations on heterogeneous platforms. At its core, a Morpheus DynamicMatrix is a C++ container capable of representing a sparse matrix as one of multiple standard formats—principally CSR (Compressed Sparse Row), ELL (ELLPACK), or HYB (a hybrid combining ELL and COO)—and dynamically switching among them at runtime. This facility is central to the Morpheus library's strategy of separating the user-facing interface from platform- and problem-specific optimizations, thereby allowing end-users to benefit from runtime format selection without deep knowledge of individual sparse matrix storage formats (Stylianou et al., 2022).
1. Definition and Data Layouts
A DynamicMatrix encapsulates the following:
- CSR, ELL, and HYB (ELL+COO) format instances, with an internal enum indicating the "active" representation.
- An external interface matching the APIs for the underlying formats (e.g., SpMV, element-wise update, conversion).
The data layouts for each format are as follows:
| Format | Storage Arrays | Row Organization |
|---|---|---|
| CSR | val, | Nonzeros in row in |
| col, | ||
| row_ptr | ||
| ELL | val_ell, col_ell | Each row padded to |
| HYB | ELL for up to nonzeros/row, remainder in COO |
DynamicMatrix enables seamless activation of any supported format through format conversion routines. For example:
1 2 3 4 |
DynamicMatrix<double,int,Cuda,Device> A; CsrMatrix<double,int,...> A_csr = ...; A = A_csr; // A.active() == Format::CSR A.activate(Format::ELL); // Converts in place to ELL |
2. Internal Architecture and API
DynamicMatrix is implemented following the State and Visitor design patterns. Internally, it holds one instance of each supported sparse matrix format and a state variable designating the active format. The API provides:
- Construction from any supported concrete format (CSR, ELL, HYB).
- Format query (
active()) and explicit activation (activate(Format f)). - Unified high-level algorithms including matrix-vector multiply (
multiply()), elementwise and structural conversions (copy_from(),convert_from()). - Templated interfaces for vector and backend abstractions.
- Dispatch to platform-specific or optimized kernels determined dynamically according to the active storage format.
Morpheus leverages Kokkos abstractions for execution space and memory space, ensuring compatibility across CPUs and GPUs. All functions maintain a consistent signature irrespective of the backend or storage format, which enables source code portability.
3. Dynamic Runtime Format Selection
Morpheus DynamicMatrix exposes two runtime format selection mechanisms:
- Manual selection: The user can invoke
A.activate(Format::ELL)or similar, or specify the format via configuration at runtime. - Automatic (auto-tuning): The library benchmarks the cost of candidate formats for operations such as SpMV, then selects for subsequent computations. In distributed contexts (e.g., MPI), selection can be performed per-process and for local versus ghost submatrices, optimizing .
Conversions between formats use COO as a proxy; the conversion cost (e.g., CSRELL) is proportional to the number of nonzeros, , and is empirically shown to be negligible compared to . Section IV–A demonstrates that switching DynamicMatrix to CSR in HPCG incurs at most a 5% overhead, frequently yielding a minor speedup (Stylianou et al., 2022).
4. Performance Portability Benchmarks
Comprehensive benchmarking was conducted on the ARCHER2 (AMD EPYC 7742, 64-core nodes, OpenMP) and Cirrus (Dual Xeon Gold 6248 + 4×Tesla V100, CUDA) platforms. Key empirical results include:
- On single-node CPUs (ARCHER2), switching from CSR to DIA format delivered up to speedup for large 27-point-stencil HPCG matrices.
- On Cirrus GPUs, DIA outperformed CSR by up to .
- In strong scaling regimes (512×512×256 problem, ARCHER2), DynamicMatrix achieved up to speedup over CSR-only HPCG.
- With single Cirrus GPUs (384×256×128), DIA led to a acceleration, decreasing to at 8 GPUs as communication became dominant.
- In weak scaling, exploiting local DIA and ghost COO placement led to (ARCHER2) and (Cirrus GPU) improvements.
Overall, DynamicMatrix's runtime format selection yielded (CPU) and (GPU) improvement in the SpMV kernel of HPCG, all achieved with no code modifications beyond the Morpheus port (Stylianou et al., 2022).
5. Porting High Performance Conjugate Gradient (HPCG) to Morpheus
The HPCG benchmark was ported to Morpheus DynamicMatrix through three main steps:
- Vector replacement: Substitution of user vectors with Morpheus DenseVector, aliasing existing memory buffers, and direct replacement of dot and WAXPBY operations.
- Matrix replacement: Replacement of HPCG's SparseMatrix (CSR-like, pointer-of-pointers) with Morpheus CsrMatrix, with elementwise conversion at setup. SpMV replaced by
Morpheus::multiply. - Dynamic conversion: Conversion of CsrMatrix to DynamicMatrix at setup, activation of the runtime-selected format, and use of
Morpheus::multiply.
GPU support required creation of HostMirror containers and orchestrated host-device copying for data movement, including MPI halo exchange. The total code changes amounted to of the HPCG code base, isolated in the linear algebra layer. No performance regressions were detected; in pure-CSR mode, a 5% speedup was observed. Importantly, future support for additional data formats or backends necessitates changes only in Morpheus, not in HPCG.
6. Properties and Implications
Morpheus DynamicMatrix offers:
- Unified abstraction for multiple formats (CSR, ELL, HYB, and extensible to others).
- Support for shallow, deep, and element-wise copy/conversion.
- API compatibility with host-device mirroring for heterogeneous systems.
- Simple auto-tuning for runtime selection of format with minimal or zero overhead.
- Demonstrated substantial performance benefits in real scientific codes (e.g., HPCG).
A plausible implication is that DynamicMatrix’s approach can serve as a prototype for future performance-portable linear algebra libraries targeting rapidly evolving architectures and diverse sparsity patterns. The decoupling of format selection from user code allows for incremental adoption on existing HPC codes with minimal refactoring (Stylianou et al., 2022).