Sparse-Dense Matrix Multiplication

Updated 5 May 2026

SpMM is a computational operation that multiplies a sparse matrix A with a dense matrix B while processing only nonzero elements to enhance efficiency.
It employs specialized storage formats like CSR and GCOO and leverages hardware-specific optimizations such as GPU tensor cores and SIMD tuning.
Efficient SpMM requires addressing challenges like load imbalance, memory bandwidth constraints, and dynamic scheduling for high-throughput applications.

Sparse-Dense Matrix Multiplication (SpMM) refers to the operation $C = A \cdot B$ , where $A \in \mathbb{R}^{m \times k}$ is a sparse matrix (typically with $\text{nnz}(A) \ll mk$ ) and $B \in \mathbb{R}^{k \times n}$ is a dense matrix, producing a dense or semi-dense output $C \in \mathbb{R}^{m \times n}$ . SpMM is a central primitive in scientific computing, graph analytics, deep learning, and numerous algorithmic domains, with extensive architecturally specialized and software-optimized implementations across CPUs, GPUs, FPGAs, and custom accelerators (Buluç, 6 Aug 2025, Sun et al., 2023, Huang et al., 2020, Xiang et al., 8 Apr 2025, Shi et al., 2024, Song et al., 2021, Fu et al., 2023).

1. Mathematical Formulation and Computational Complexity

Let $A$ be sparse and $B$ dense. The canonical row-wise formula is: $C_{i,j} = \sum_{\ell=1}^k A_{i,\ell} \cdot B_{\ell,j}$ Usually, only nonzero entries of $A$ are processed, so: $C_{i,j} = \sum_{\ell \in \text{nz}(A_i)} A_{i,\ell} \cdot B_{\ell,j}$ where $A \in \mathbb{R}^{m \times k}$ 0 is the set of columns in row $A \in \mathbb{R}^{m \times k}$ 1 with nonzeros.

The total floating-point operation count is $A \in \mathbb{R}^{m \times k}$ 2 (each nonzero incurs $A \in \mathbb{R}^{m \times k}$ 3 multiplies and $A \in \mathbb{R}^{m \times k}$ 4 adds), with memory traffic governed by reading $A \in \mathbb{R}^{m \times k}$ 5 values and indices, $A \in \mathbb{R}^{m \times k}$ 6 dense elements from $A \in \mathbb{R}^{m \times k}$ 7 (possibly streamed/tiled), and writing $A \in \mathbb{R}^{m \times k}$ 8 outputs (Buluç, 6 Aug 2025, Shi et al., 2024, Sun et al., 2023, Huang et al., 2020).

Generalization to user-defined semirings is routine in many algebraic graph kernels, i.e.,

$A \in \mathbb{R}^{m \times k}$ 9

where $\text{nnz}(A) \ll mk$ 0 satisfies monoid and distributive properties (Buluç, 6 Aug 2025).

2. Storage Formats and Dataflow Implications

Most SpMM kernels assume $\text{nnz}(A) \ll mk$ 1 is stored in Compressed Sparse Row (CSR) or Blocked extensions (e.g., BCSR, GCOO, ME-BCRS), while $\text{nnz}(A) \ll mk$ 2 and $\text{nnz}(A) \ll mk$ 3 are dense row-major.

Common CSR representation:

rowPtr[0..m]: index into nonzero lists per row,
colInd[0..nnz-1]: column indices for nonzeros,
val[0..nnz-1]: values.

Advanced formats optimize memory hierarchy and core alignment:

Grouped COO (GCOO): partitions columns into coarse groups for shared-memory reuse (Shi et al., 2020).
Blocked formats (ME-BCRS, BCSR): exploit 2D tile regularity to maximize contiguous copy and hardware acceleration (Shi et al., 2024, Li et al., 2024, Ma et al., 3 Mar 2025).

Choice of sparse format strongly affects coalesced memory access, vectorization, and the mapping to accelerator MMA units (e.g., for GPUs and SME on Arm) (Shi et al., 2024, Xiang et al., 8 Apr 2025, Shi et al., 2020, Lei et al., 11 Nov 2025).

3. Algorithmic Strategies on Modern Architectures

3.1 GPU/TPU-Tailored SpMM

CUDA Scalar Kernel: Assigns one thread or warp per row, exploiting output (dense) parallelism (Huang et al., 2020, Li et al., 2024). To mitigate load imbalance and uncoalesced access, warp merging, coalesced row caching, or batched schemes are deployed (Nagasaka et al., 2019).
Tensor/Core Unit (TCU) Exploitation:
- Dense zero-filling: Tiles of sparse $\text{nnz}(A) \ll mk$ 4 are zero-padded to TCU shape (e.g., $\text{nnz}(A) \ll mk$ 5), then processed via MMA (Xiang et al., 8 Apr 2025).
- Hybrid approaches (e.g., cuTeSpMM): Use a "TCU-synergy" metric (tile fill rate) to decide when to invoke TCU vs. CUDA core kernels (Xiang et al., 8 Apr 2025, Li et al., 24 Feb 2026).
- Fine-grained methods (e.g., FlashSparse) minimize redundancy by swapping operands ( $\text{nnz}(A) \ll mk$ 6), aligning sparse granularity to the narrow TCU dimension, and removing superfluous zero padding (Shi et al., 2024).
- Specialized schemes (RSH-SpMM): Employ adaptive row partitioning and row-structured tiling to balance dense-tile formation against residual irregular rows routed to CUDA cores (Li et al., 24 Feb 2026).

3.2 CPU/SME/Other Accelerators

SIMD-Aware JIT and Autotuning: Systems like JITSPMM generate architecture and matrix-shape-aware inner loops at runtime, aggressively unrolling vectorized accumulations, optimizing register allocation, and selecting between row/nnz/merge partitioning for multithreading (Fu et al., 2023).
Hybrid SME/NEON (Armv9): LOOPS partitions A into CSR rows for NEON AXPY and narrow-vector BCSR tiles for SME's outer product units (fmopa), scheduling via a lightweight two-level performance model (Lei et al., 11 Nov 2025).
General-purpose Hardware: FusedMM fuses SpMM with related kernels (e.g., SDDMM), exposing user-defined accumulation, load-balancing, and cache-blocked register allocation for best SIMD efficiency (Rahman et al., 2020).

3.3 Custom/Flexible Accelerators

Streaming Dataflow (Sextans): Implements II=1 (initiation interval) pipelines for sparse-tuple stream, N-dense memory windows, and on-chip accumulation with pointer-based task decomposition (Song et al., 2021).
IOPS and Unified Inner/Outer Product: Fuses inner-product (maximal output locality) and outer-product (maximal zero skipping) in a PE mesh, with on-chip bookkeeping of partial sums and adaptive tiling based on buffer budgets and input sparsity (Sun et al., 2023).
Distributed RDMA/SHMEM Contexts: SpMM is tiled over process grids, often using asynchrony and dynamic replication to minimize communication, with communication-eliding fusion for SDDMM→SpMM chains (Bharadwaj et al., 2022, Brock et al., 2023).

3.4 Batched and Semi-external Techniques

Batched SpMM: Aggregates many small SpMMs (e.g., multiple GCN minibatches), assigning warps/subwarps for high GPU occupancy and shared-memory staging, critical in small-graph scenarios (Nagasaka et al., 2019).
Semi-external SpMM: For out-of-core large-scale problems, the sparse matrix resides on SSDs, while B is tiling-streamed through memory, orchestrated via asynchronous threads and task queues, achieving near in-memory performance (Zheng et al., 2016).

4. Performance Models and Bottleneck Analysis

A central analytic rubric is the "roofline model", bounding throughput by the minimum of compute peak and memory bandwidth, scaled by operational intensity (FLOPs / bytes moved). For naïve CSR SpMM: $\text{nnz}(A) \ll mk$ 7 Operational intensity increases by blocking, shared-memory reuse, or aggregating multiply-accumulates across repeated column or tile indices (Shi et al., 2020, Ma et al., 3 Mar 2025, Sun et al., 2023).

Metrics such as "TCU-synergy" (average tile density) (Xiang et al., 8 Apr 2025), block physical fill (in block formats), and arithmetic-to-I/O ratios determine the practical kernel and hardware mapping.

At moderate-to-high sparsity, SpMM is typically memory bandwidth–bound; as sparsity increases further, compute-to-load ratio drops and the kernel can become memory-bound even with highly tuned implementation (Shi et al., 2024, Ma et al., 3 Mar 2025). On accelerator arrays, communication costs can dominate, requiring topology-aware distribution and fusion (Bharadwaj et al., 2022).

5. Architecture-Specific Optimization Strategies

Hardware	Key Strategies	Representative Papers
CPUs	SIMD blocking, JIT, cache tiling, balanced partition, semiring fusion	(Fu et al., 2023, Rahman et al., 2020, Buluç, 6 Aug 2025)
NVIDIA GPUs	Memory-coalesced loads, warp-merge, TCU tile-packing, hybrid cores	(Li et al., 2024, Shi et al., 2024, Xiang et al., 8 Apr 2025, Huang et al., 2020, Li et al., 24 Feb 2026)
Arm SME	Partitioned CSR/BCSR, hybrid NEON/SME scheduler, model-guided split	(Lei et al., 11 Nov 2025)
FPGAs	Streaming II=1, pointer-based tile queuing, HBM pipelining	(Song et al., 2021)
CS-3	SELLPACK multi-channel streaming, chunked I/O/PE sync	(Shah et al., 30 Apr 2026)
Distributed	1.5D/2.5D dense/sparse shifting, communication-eliding, FusedMM	(Bharadwaj et al., 2022, Brock et al., 2023)

Editor’s term: This table succinctly organizes the major architecture-specific strategies for SpMM using only information from referenced data.

6. Applications and Case Studies

SpMM underpins:

Graph neural network (GNN) operations (e.g., message-passing, pooling, feature propagation) (Huang et al., 2020, Li et al., 2024, Rahman et al., 2020, Song et al., 2021, Shi et al., 2024)
Scientific simulations (FEM/CFD), eigensolvers (LOBPCG, Arnoldi), low-rank matrix factorization, clustering, and recommendation algorithms (Buluç, 6 Aug 2025, Zheng et al., 2016)
Block-sparse attention (transformers), CTQFT tensor networks, collaborative filtering, PCA/NMF, and sampling-based randomized linear algebra (Buluç, 6 Aug 2025, Shi et al., 2024)
Billion-node graph analytics, PageRank, block-Krylov eigensolving, NMF at semi-external scale (Zheng et al., 2016)

Typical matrix shapes and sparsity regimes vary from extremely tall/skinny or power-law-structured graphs (bioinformatics, social networks) to blocky/supernode matrices in quantum simulations.

7. Directions, Open Challenges, and Best Practices

While the last five years have seen major advances, several avenues remain at the forefront:

Irregularity Mitigation: Designs like RSH-SpMM and HC-SpMM highlight data-dependent adaptive partitioning, cross-kernel fusion, and row-structured reordering as essential to stable high throughput under real-world sparsity distributions (Li et al., 24 Feb 2026, Li et al., 2024).
Format Auto-Tuning: ME-BCRS, GCOO, and hybrid CSR/BCSR partition require empirical profiling or online adaptation to maximize tile density and SIMD/TCU utilization per matrix (Shi et al., 2024, Lei et al., 11 Nov 2025, Shi et al., 2020).
Pipeline and Prefetch Tuning: Fine-grained pipelining (double buffering, staged prefetches) is critical to saturate DRAM and on-chip memory bandwidth, especially on GPUs/SME (Ma et al., 3 Mar 2025, Shi et al., 2024, Li et al., 24 Feb 2026).
Extending Beyond SpMM: FusedMM and IOHP kernels extend ideas across SDDMM, sparse-sparse MM, and higher-order contractions, exploiting similar dataflow, accumulator, and reduction logic (Sun et al., 2023, Rahman et al., 2020, Shi et al., 2024).
Energy Efficiency and Cross-Platform Comparison: Recent works report up to $\text{nnz}(A) \ll mk$ 8– $\text{nnz}(A) \ll mk$ 9 GFLOPS/W edge for SME/CPU frameworks versus GPUs for selected workloads (Lei et al., 11 Nov 2025).
Out-of-Core and Distributed SpMM: Semi-external strategies, distributed-memory RDMA kernels, and adaptive communication-eliding synthesis are essential for scaling to billion-edge graphs and multi-node ML systems (Zheng et al., 2016, Brock et al., 2023, Bharadwaj et al., 2022).

Best practice (across hardware): orchestrate blocking, coalesced memory access, SIMD/TCU/Tile mapping, and adapted row/column partitioning guided by operational intensity, bandwidth, and (increasingly) data-dependent dynamic scheduling for practical, robust SpMM performance.

Key references:

(Buluç, 6 Aug 2025) "The Ubiquitous Sparse Matrix-Matrix Products"
(Shi et al., 2024) "FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores"
(Sun et al., 2023) "IOPS: An Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product"
(Li et al., 24 Feb 2026) "RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs"
(Lei et al., 11 Nov 2025) "LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures"
(Xiang et al., 8 Apr 2025) "cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores"
(Huang et al., 2020) "GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks"
(Song et al., 2021) "Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication"
(Zheng et al., 2016) "Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs"
(Fu et al., 2023) "JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication"
(Rahman et al., 2020) "FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks"
(Li et al., 2024) "HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores"
(Nagasaka et al., 2019) "Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks"
(Bharadwaj et al., 2022) "Distributed-Memory Sparse Kernels for Machine Learning"
(Brock et al., 2023) "RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs"
(Ma et al., 3 Mar 2025) "NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU"
(Shi et al., 2020) "Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format"
(Shah et al., 30 Apr 2026) "Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3"