Sparse Matrix-Vector Multiplication on GPUs

Updated 10 May 2026

SpMV on GPUs is a computational kernel defined by irregular memory accesses and low arithmetic intensity, crucial for scientific computing and graph analytics.
It leverages specialized sparse storage formats and warp-oriented layouts, such as CSR5 and ELL-WARP, to maximize memory coalescing and balance loads.
Adaptive GPU kernel strategies incorporate auto-tuning and ML-assisted optimization, achieving significant speedups and energy efficiency improvements.

Sparse matrix-vector multiplication (SpMV) on GPUs is a high-impact computational kernel in scientific computing, graph analytics, and iterative solvers. It presents unique algorithmic and architectural challenges due to irregular memory access, high memory bandwidth requirements, load imbalance, and low arithmetic intensity. Contemporary research addresses these constraints with specialized storage formats, advanced GPU-aware partitioning and scheduling, hybrid approaches, auto-tuning, and learning-based optimization frameworks, in order to realize efficient, portable, and scalable SpMV across the breadth of sparse matrix types and hardware generations.

1. Architectural Characteristics and Memory-Bound Nature

SpMV on GPUs is distinctly constrained by memory bandwidth rather than peak floating-point throughput due to the following factors:

Irregular Memory Access: The essential operation, $y_i = \sum_j A_{ij} x_j$ , requires random accesses to the $x_j$ vector, which destroys memory coalescing and cache line utilization (Wong et al., 2015, Grossman et al., 2016, Mpakos et al., 2023, Mohammed et al., 2022).
Low Arithmetic Intensity: Each nonzero triggers only two floating-point operations (multiply-add), yielding $\approx 0.17$ flop/byte in double precision—far below the ratios of the hardware (Gao et al., 2024, Wong et al., 2015).
Thread and Warp Imbalance: Variability in row nonzeros leads to divergent execution and underutilization within GPU warps and SMs.
Memory Hierarchy: GPU global (“HBM”) memory typically offers 400–1500 GB/s, orders of magnitude lower latency than registers or shared memory, but irregular access patterns prevent good cache occupancy and hit rates (Mpakos et al., 2023).

For instance, the effective bandwidth and throughput on large A100 devices are bounded by $B_{\rm eff} ≲ 225$ GFLOP/s due to memory (Mpakos et al., 2023). Empirical peak bandwidth-utilization of highly optimized SpMV kernels on recent V100/A100 GPUs achieves only 50–80% of these hardware rooflines (Mpakos et al., 2023, Mohammed et al., 2022, Chen, 2022).

2. Sparse Storage Formats and Layouts for GPUs

GPU performance is dominated by how well the matrix format enables memory coalescing, load balance, and minimized zero/metadata overhead (Mohammed et al., 2022, Wong et al., 2015, Grossman et al., 2016, Gao et al., 2024). Key formats include:

Format	Structure	GPU Pros/Cons
CSR	val[nnz], col[nnz], ptr[n+1]	Minimal storage, poor for coalesced x lookups and long rows
COO	row[nnz], col[nnz], val[nnz]	Fine-grain parallelism, but y reduction needs atomics/serialization
ELLPACK	val[m×K], col[m×K], zero-padded	Fully coalesced loads, but high padding for irregular matrices
HYB	ELL (dense prefix per row) + COO tail	Amortizes ELL’s padding; requires dual kernel logic
SELL-C-σ	Sliced ELL with sorting	Trades padding vs. sorting; (K) rows per slice, limited zero-fill and improved coalescence (Mohammed et al., 2022)
CSR5	2D tile extension of CSR	Uniform load across tiles/warps, extra metadata, best for irregular matrices (Mpakos et al., 2023, Mohammed et al., 2022)
pJDS	Padded jagged diagonals, per-warp	Substantially reduces padding, coalesces loads, effective on high row-length variance (Kreutzer et al., 2011)
ELL-WARP	Warp-oriented ELL hybrid	Row sorting and warp-aligned chunking limit padding, achieve high effective bandwidth (Wong et al., 2015)
Blocked (BCSR/BELLPACK)	Dense subblocks	Enhances data reuse and compute intensity for block-structured matrices (Gao et al., 2024)

Canonical warp-oriented layouts (SELL-C-σ, ELL-WARP, pJDS) are necessary to achieve coalescence and restrict padding to block-local maxima, rather than to worst-case global maxima (Wong et al., 2015, Mohammed et al., 2022, Kreutzer et al., 2011). Specialized schemes such as EHYB explicitly partition and shard $x$ into shared memory, cutting $x$ -fetch traffic by 2× and further boosting arithmetic intensity (Chen, 2022).

3. GPU Kernel Design, Load Balancing, and Partitioning

High-efficiency SpMV on GPUs requires grid/thread/warp/block mapping that matches the storage layout (Wong et al., 2015, Mohammed et al., 2022, Kreutzer et al., 2011). Major principles:

Row/warp clustering: In SELL-C-σ, pJDS, and ELL-WARP, rows sorted by nonzero count are grouped into warp-sized blocks or slices, limiting zero-fill to local maxima and keeping all warps fully occupied.
Coalescing: To exploit DRAM bandwidth, data and index arrays must be in column-major order per warp/slice to realize 128-byte aligned, coalesced memory transactions (Wong et al., 2015).
Multi-thread per row: For long or irregular rows, it is optimal to allocate an entire warp or block to one row (as in ELL-WARP K2 or CSR-vector2), performing reductions in shared memory (Wong et al., 2015, Mohammed et al., 2022).
Explicit x Caching: EHYB (Chen, 2022) loads partition-local chunks of $x$ into shared memory, greatly reducing remote fetches; index compression (packing to 16-bit) reduces further bandwidth.
Partitioned/Hashed Block Scheduling: Hash-partitioned and 2D-blocked approaches (HBP (Yan et al., 11 Apr 2025)) match block/warp assignment to nonzero distributions, improving divergence and throughput without global sorts.
Hybridization and Tail Handling: Outlier long rows (“tails”) are handled separately in an atomic/COO or small CSR kernel, with the bulk processed efficiently by regular, fast paths (Kreutzer et al., 2011, Wong et al., 2015).
Multi-GPU SpMV: Fine-grain “equal-nnz” partitioning (pCSR, pCSC in MSREP (Chen et al., 2022)) is essential for scalable strong/weak scaling; communication and local x/y vector replication is needed to avoid bottlenecks.

4. Auto-Tuning and Machine Learning–Assisted Optimization

Modern frameworks systematically optimize SpMV for both latency and energy via two-stage auto-tuning:

Compile-Time Tuning: Choose thread-block size, register budget, and memory hierarchy configurations (e.g., L1/shared split), using feature vectors (row count, nnz, nnz statistics) and classifiers such as Decision Tree, Random Forest, MLP (Auto-SpMV (Ashoury et al., 2023, Mohammed et al., 2022)).
Run-Time Format Selection: For a given sparse matrix, extract eight or more sparsity features, then invoke trained ML models (tree, SVM, ensemble) to select optimal format and kernel (CSR, ELL, BELL, SELL) for each objective (minimum latency $T$ , minimum energy $E$ , power $P$ , energy efficiency $x_j$ 0) (Ashoury et al., 2023, Gao et al., 2024).
Online Kernel Selection: Adaptive frameworks for SpMV/SpMSpV maintain multiple kernel implementations for different vector sparsity/row irregularity regimes, with decision-tree–based dynamic selection at each solve or iteration (Li et al., 2020).
Performance Impact: Up to 52% reduction in latency/energy, 99.7% energy-efficiency improvement, and 3–7× speedup over static-kernel libraries have been demonstrated across recent benchmark suites (Ashoury et al., 2023, Li et al., 2020).
Practical considerations: Feature extraction and ML inference overhead must be kept sub-2% of end-to-end runtime for these ML-assisted pipelines to maintain net benefit.

5. Empirical Performance and Comparative Evaluation

Extensive surveys and head-to-head benchmarks consistently validate that SpMV is highly memory-bound and format-sensitive:

Vendor Libraries: CSR (CUSPARSE) remains baseline for low/regular row count, but advanced formats (CSR5, SELL-C-σ, EHYB, ELL-WARP) achieve 20–50% higher bandwidth and throughput for irregular/unstructured problems (Wong et al., 2015, Mpakos et al., 2023, Chen, 2022).
Performance Range: Optimized SpMV implementations achieve 50–160 GFLOP/s on P100/V100/A100 GPUs for “generic” matrices, 140–300 GFLOP/s on highly regular or well-sliced inputs, but can fall to 30–60 GFLOP/s if either the average nnz/row is small ( $x_j$ 1) or access patterns are highly irregular (Mpakos et al., 2023, Mohammed et al., 2022).
Multi-GPU Scaling: MSREP achieves nearly linear scaling up to 8 A100 GPUs, exceeding 6× speedup, using only lightweight pointer slices and equal-nnz partitioning (Chen et al., 2022).
Preprocessing Overheads: Some techniques (e.g., entropy-based maximizing, EHYB’s graph partition) incur significant one-time costs, only justified when the SpMV kernel is amortized over many right-hand sides or iterations (D'Alberto et al., 2023, Chen, 2022).
Specialization Wins: ELL-WARP and CSR-k outperform general-purpose CSR5 and vendor implementations on PDE-style regular meshes, while their advantage disappears on “power-law” graphs or highly variable row-length distributions (Wong et al., 2015, Lane et al., 2022, Kreutzer et al., 2011).
Hash-based Reordering: Novel nonlinear hash block methods (HBP) deliver 1.6–3× speedup and 3–7× faster preprocessing than classic 2D-tiling or sort+DP schemes, for a range of real sparse matrices on modern GPUs (Yan et al., 11 Apr 2025).

6. Application Domains and Specialized Algorithms

SpMV is central to:

Finite Element and Finite Difference Solvers: Direct assembly and iterative CG/GMRES solvers routinely invoke SpMV on GPU. On heart modeling meshes ELL-WARP K2 achieves up to 12× speedup over single-core CPU PETSc (block Jacobi), and 5–6× over 4-core PETSc (Wong et al., 2015).
Graph Analytics: For graphs with power-law degree, composite tile/partitioned SpMV kernels that exploit heavy-tail structure yield 1.8–2.1× higher GFLOP/s and 18–32× end-to-end speedup over multicore CPU (Yang et al., 2011).
Block-Kernel and Multivector SpMV: By keeping multiple right-hand sides on-GPU and arranging block $x_j$ 2/ $x_j$ 3 vectors in column-major order, bandwidth and coalescing are improved; essential in block-Krylov, block-Wiedemann algorithms (Boyer et al., 2010).
Parallel Rank Solvers: Custom SpMV for field/ring arithmetic in LinBox, with modular/±1 extraction and delayed reduction, supports 3–5× GPU acceleration for black-box rank computations (Boyer et al., 2010).

7. Best Practices, Open Problems, and Emerging Directions

Key recommended practices—derived from across surveyed works—include (Mohammed et al., 2022, Wong et al., 2015, Chen, 2022, Mpakos et al., 2023):

Format selection: Dynamically choose among CSR5, SELL-C-σ, ELL-WARP, or EHYB based on measured nnz/row, row variance (skew), and irregularity metrics.
Memory layout: Always store val/col in column-major within each warp/slice for coalesced fetches; minimize global zero-padding and localize to slices (SELL-C-σ).
Adaptive kernel tuning: Pre-permute x/col vectors to maximize store/load coalescence; tune blocks and thread allocation to $x_j$ 4 for best occupancy; leverage texture/read-only cache for $x_j$ 5 when access pattern is semi-random.
Amortize reordering cost: Fuse permutation with other assembly kernels or amortize over $x_j$ 6 SpMV calls characteristic of iterative solvers.
Heterogeneous/Multi-GPU: Adopt equal-nnz “p” formats (MSREP), replicate vectors locally, and overlap result merging and communication for scalable performance.
Auto-tuning and ML selection: Integrate feature extraction, learned selection, and fast kernel switching for mixed workloads; maintain <10% net overhead.
Energy efficiency: Employ auto-tuned launch parameters and format selection (Auto-SpMV) to cut energy/power by up to 2×, especially for low-power or mobile deployments (Ashoury et al., 2023).

Outstanding challenges remain in:

Balancing preprocessing cost versus kernel speedup for dynamic, streaming, or time-varying sparsity structures.
Supporting “extreme” irregularity or mixed-precision requirements without excessive format conversion/overhead.
Generalizing efficient SpMV frameworks to multi-GPU, distributed-memory, or heterogeneous clouds with one matrix partitioned over CPUs, GPUs, FPGAs, or both.
AI-guided adaptive format synthesis and online format/kernels switching with negligible runtime overhead.
Standardizing performance metrics beyond GFLOP/s, including effective bandwidth, power/play ratio, selection latency, and memory footprint.

The state-of-the-art in SpMV on GPUs is thus defined by dynamic format selection, warp-sliced layout, explicit memory hierarchy exploitation, and end-to-end auto-tuning pipelines that can adapt to the full heterogeneity of contemporary sparse workloads (Wong et al., 2015, Ashoury et al., 2023, Mohammed et al., 2022, Lane et al., 2022, Yan et al., 11 Apr 2025, Grossman et al., 2016).