Sparse Operations in Scientific and ML Computing

Updated 4 April 2026

Sparse operations are computational primitives that process only nonzero elements using tailored data arrangements like CSR and COO, enabling efficient memory and compute usage.
They underpin key kernels such as SpMV, SpMM, SpGEMM, and SDDMM, driving performance improvements in scientific computing, graph analytics, and deep learning.
Researchers focus on algorithmic optimizations, hardware-software co-design, and adaptive APIs to address challenges from irregular sparsity and enhance scalability.

Sparse operations are computational primitives in which most operands are zero and both storage and compute are structured to act only on nonzero entries. These operations pervade applications such as scientific computing, numerical simulation, deep learning, graph analytics, computational biology, and information retrieval. Their defining characteristics—irregular sparsity patterns, indirect memory accesses, and data-dependent control flow—distinguish them fundamentally from their dense counterparts, requiring specialized data structures, algorithms, and hardware-software co-design to achieve high performance.

1. Mathematical Formulations and Canonical Sparse Operations

The core sparse linear algebra kernels include:

Sparse Matrix-Vector Multiplication (SpMV):

$y_i = \sum_{j \in \mathrm{nz}(A_i)} A_{ij} x_j$

where $\mathrm{nz}(A_i)$ is the set of nonzero columns in row $i$ (Yan et al., 2024).

Sparse Matrix–Dense Matrix Multiplication (SpMM):

$C_{i\ell} = \sum_{j\in \mathrm{nz}(A_i)} A_{ij} B_{j\ell}$

(Yan et al., 2024, Buluç, 6 Aug 2025).

Sparse-Sparse Matrix Multiplication (SpGEMM/SpMSpM):

$C_{ij} = \bigoplus_{k: A_{ik}\neq 0, B_{kj}\neq 0} A_{ik} \otimes B_{kj}$

where $\oplus, \otimes$ denote elementwise sum and multiply, possibly over a custom semiring (Buluç, 6 Aug 2025).

Sampled Dense-Dense Matrix Multiplication (SDDMM):

$D_{ik} = A_{ik} \cdot \left( \sum_j B_{ij} C_{jk} \right)$

computed only where $A_{ik} \neq 0$ (Yan et al., 2024, Li et al., 12 May 2025).

Other Operations:
- Sparse addition, elementwise multiply
- Triangular solve, Cholesky/QR/LU factorizations
- (Sub)manifold sparse convolution (Graham et al., 2017)
- Sparse tensor contractions (Sivkov et al., 2019)

The algebraic generality is reflected in the semiring abstraction: many kernels permit overloading of $\oplus$ and $\otimes$ —enabling applications to graph algorithms, distance computations, and custom aggregations (Buluç, 6 Aug 2025, Nolet et al., 2021).

2. Data Structures and Storage Formats

Sparse operations depend critically on optimal data layouts that index only nonzeros:

COO (Coordinate): Each nonzero is a tuple $\mathrm{nz}(A_i)$ 0; simple, but inefficient for multiple passes.
CSR (Compressed Sparse Row): Triplet of arrays (rowPtrs, colIdx, vals); dominant for row-based operations, ensures row-major streaming (Nytko et al., 2022, Yan et al., 2024).
CSC (Compressed Sparse Column): Analog of CSR, column-oriented.
Block-sparse formats (BSR, DBCSR): Group nonzeros into $\mathrm{nz}(A_i)$ 1 dense blocks, reducing index overhead and increasing arithmetic intensity; critical to many scientific and machine-learning workloads (Sivkov et al., 2019, Li et al., 2023).
Hierarchical bitmap (SMASH): Encodes blocks of nonzeros via multi-level bitmaps, suitable for hardware-accelerated indexing by a Bitmap Management Unit (BMU) (Kanellopoulos et al., 2019).
F-COO for tensors: Identifies product/index modes and encodes sparse tensors for memory-bandwidth efficiency on GPUs (Liu et al., 2017).

The choice of format impacts indexing cost, memory bandwidth, and kernel fusions; many frameworks dynamically select or infer formats (e.g., Scorch compiler stack (Yan et al., 2024)).

3. Algorithms, Scheduling, and Compiler Approaches

Sparse operations involve two main algorithmic patterns:

Direct iteration: Row-major or column-major scan, accumulating only over nonzeros (Nytko et al., 2022).
Two-phase (symbolic + numeric): In SpGEMM, first determine output pattern (symbolic), then populate values (numeric); required because output nnz is data-dependent (Abdelfattah et al., 2024, Buluç, 6 Aug 2025).

Optimizations include:

Blocking and tiling: Both for exploiting hardware caches (CPU/GPU), fitting working sets to L1/L2, and for mapping to tensor cores or accelerators (Li et al., 12 May 2025, Li et al., 2022, Sivkov et al., 2019).
Auto-tuning: Libraries like iSpLib search over block sizes, thread configurations, and kernel fusions to maximize SIMD and memory efficiency (Anik et al., 2024).
Loop reordering and workspace management: Modern compilers (e.g., Scorch) analyze expressions to choose index order, insert temporaries, and select optimal tilings (Yan et al., 2024).
Compile-time specialization: SpComp statically analyzes sparsity patterns, extracts essential indices, and emits piecewise-regular, indirect-free code for maximal cache and locality performance (Basak et al., 2023).
Sampling/probabilistic approximation: RSC replaces SpMM/SpGEMM in GNNs with unbiased random sampling, controlling compute/resource trade-off epochwise and layerwise, achieving up to $\mathrm{nz}(A_i)$ 2 per-layer and $\mathrm{nz}(A_i)$ 3 end-to-end speedup with negligible accuracy loss (Liu et al., 2022).
Hardware-software co-design: E.g., SMASH with BMU, PopSparse on IPU, Azul on FPGA, all exploit block layouts, meta-instructions, or spatial parallelism to address the complexity of sparse operation execution (Kanellopoulos et al., 2019, Li et al., 2023, Parthasarathy, 15 Sep 2025).

4. Sparse Operations in Machine Learning and Deep Learning

Sparse kernels underpin critical operations in modern ML:

Graph neural network (GNN) propagation: Core step is SpMM between the (sparse) normalized adjacency and dense node embeddings (Liu et al., 2022, Yan et al., 2024, Anik et al., 2024). Libraries like iSpLib and Scorch provide auto-tuned or compiler-generated sparse ops to accelerate GNN workloads by up to $\mathrm{nz}(A_i)$ 4 over PyTorch baselines (Anik et al., 2024, Yan et al., 2024).
Sparse attention / transformers: SDDMM and SpMM form the backbone of attention over sparse graphs or masks. Fused3S fuses SDDMM, softmax, and SpMM into a single high-utilization GPU kernel, reaching up to $\mathrm{nz}(A_i)$ 5 kernel speedup and $\mathrm{nz}(A_i)$ 6 end-to-end inference gain in graph transformer models (Li et al., 12 May 2025).
Sparse neural architecture search (NAS): DASS embeds parametric sparse linear and convolution operations into a DARTS-like differentiable architecture search, allowing direct control of nonzero count and achieving $\mathrm{nz}(A_i)$ 7 latency gains with no loss in accuracy at $\mathrm{nz}(A_i)$ 8 pruning (Mousavi et al., 2022).
Knowledge graph embedding: Translation-based KGE models are reformulated as two SpMMs per minibatch, replacing fine-grained gather/scatter, yielding $\mathrm{nz}(A_i)$ 9– $i$ 0 speedups with lower memory footprint (Anik et al., 24 Feb 2025).

5. Performance, Scalability, and Hardware Acceleration

Performance of sparse operations is dictated by:

Arithmetic intensity: Operations per byte; block formats and fusions (e.g., DBCSR (Sivkov et al., 2019), PopSparse (Li et al., 2023), Magicube (Li et al., 2022)) maximize ratio.
Parallelism: GPUs assign rows/blocks to warps; FPGAs (Azul (Parthasarathy, 15 Sep 2025))/IPUs (PopSparse (Li et al., 2023)) distribute tiles across spatial grids, mapping blocks to local memory for high utilization.
Hardware primitives: Tensor cores (Magicube (Li et al., 2022), Fused3S (Li et al., 12 May 2025)) require specific data layouts—SR-BCRS, block sizes, data alignment to maximize MMA shape coverage.
Empirical scaling and tuning: Sparse operations vastly outperform dense when $i$ 1 is small ( $i$ 2– $i$ 3), and performance scales linearly in $i$ 4 for SpMV/SpMM. For higher densities, break-even points depend on feature dimension and hardware (Nytko et al., 2022, Yan et al., 2024).
Compiler/runtime adaptivity: Frameworks such as Scorch dynamically dispatch optimal kernels given input format and inferred sparsity threshold (Yan et al., 2024).
End-to-end acceleration: On typical GNN and transformer inference, modern libraries routinely achieve $i$ 5– $i$ 6 kernel, $i$ 7– $i$ 8 e2e speedup (Li et al., 12 May 2025, Yan et al., 2024, Anik et al., 2024).

6. Software Interfaces and Standards

Recent work converges on standardized APIs exposing polymorphic sparse operations:

C++ Sparse BLAS (2024 proposal): Multi-stage API design: kernels with known output patterns (SpMV, SpMM, scaling, SDDMM) use single-call or inspect/compute/fill design; unknown-pattern kernels (SpGEMM, elementwise multiply, add, filtering, format conversion, transpose) use explicit workspace queries followed by user allocation (Abdelfattah et al., 2024).
Backend/hardware policy objects: Execution policy and state objects enable consistent support across platforms (CPU, GPU, accelerator), allow for pre-inspection tuning, and expose advanced configuration without sacrificing portability (Abdelfattah et al., 2024).
Interoperability: APIs map to vendor kernels (MKL, cuSPARSE, rocSPARSE), support non-owning views for transparency and optional handles for opaque tuning, and are compatible with new language features such as std::mdspan and sender/receiver models in C++23.

7. Extensions, Generalizations, and Open Challenges

Sparse operations generalize to:

Semiring computation: Generalized SpGEMM, supporting custom scalar operations, underpins a range of applications (graph algorithms, sketches, distances, database joins) (Buluç, 6 Aug 2025, Nolet et al., 2021).
Tensor algebra: Higher-order contractions, as in DBCSR and F-COO, reduce to batched sparse matrix ops (Sivkov et al., 2019, Liu et al., 2017).
Selected kernels/fusions: Masked SpGEMM, SDDMM, and their variants are integrated for applications in graph attention and masked computation (Li et al., 12 May 2025).
Compile-time structure specialization: Approaches like SpComp eliminate all runtime indirection when the pattern is fixed, yielding up to $i$ 9 kernel-level and $C_{i\ell} = \sum_{j\in \mathrm{nz}(A_i)} A_{ij} B_{j\ell}$ 0 Cholesky speedup over leading libraries (Basak et al., 2023).

Key ongoing challenges include treatment of highly adaptive/dynamic sparsity, mitigation of code generation explosion with complex fill patterns, extending performance portability, and efficient support for quantized and mixed-precision sparse operations.

Sparse operations thus constitute a foundational technology for scalable computation across scientific computing, machine learning, and data analytics, requiring a multidisciplinary convergence of numerical algorithms, data layout strategies, hardware-aware optimization, and rigorous software interface design. The current research trajectory emphasizes structural, algorithmic, and hardware co-evolution, with emerging standards expected to underpin next-generation portable and high-performance sparse computing frameworks (Abdelfattah et al., 2024, Yan et al., 2024, Buluç, 6 Aug 2025, Nytko et al., 2022).