Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compressed Sparse Row Matrix

Updated 1 March 2026
  • Compressed Sparse Row (CSR) matrix is a data structure that uses three arrays to represent sparse matrices with enhanced memory efficiency and rapid row access.
  • It is widely used in scientific computing and machine learning to perform fast sparse matrix-vector multiplications and support hardware-accelerated streaming.
  • Advanced variants such as DA-CSR and CSR5 optimize memory access and parallel processing, delivering significant speedups and reduced storage overhead.

The Compressed Sparse Row (CSR) matrix format is a canonical data structure for representing general sparse matrices in an efficient, compact, and computationally amenable manner. It is foundational to high-performance scientific computing, machine learning, and sparse linear algebra, supporting fast arithmetic, amenable memory access patterns, and cross-platform compatibility. CSR serves as the default format in many libraries, hardware accelerators, and algorithmic frameworks.

1. Formal Structure and Index Mapping

Let AA be an m×nm\times n sparse matrix with nnz\text{nnz} nonzero entries. In the CSR format, AA is defined by three 1D arrays:

  • row_ptr[0m]\texttt{row\_ptr}[0 \dots m] of length m+1m+1
  • col_idx[0nnz1]\texttt{col\_idx}[0 \dots \text{nnz}-1] of length nnz\text{nnz}
  • val[0nnz1]\texttt{val}[0 \dots \text{nnz}-1] of length nnz\text{nnz}

For row ii, the contiguous segment col_idx[row_ptr[i]row_ptr[i+1]1]\texttt{col\_idx}[ \texttt{row\_ptr}[i] \dots \texttt{row\_ptr}[i+1]-1 ] contains the column indices jj of nonzeros, and val[k]\texttt{val}[k] contains the corresponding Ai,jA_{i,j} values. Thus,

nnz={(i,j)Ai,j0}\text{nnz} = |\{(i,j) \mid A_{i,j}\neq 0\}|

To retrieve any Ai,jA_{i,j}, search col_idx[row_ptr[i]row_ptr[i+1]1]\texttt{col\_idx}[ \texttt{row\_ptr}[i] \dots \texttt{row\_ptr}[i+1]-1 ] for jj. If present, the corresponding kk yields Ai,j=val[k]A_{i,j} = \texttt{val}[k]; otherwise, Ai,j=0A_{i,j} = 0 (Scheffler et al., 2023, Yang et al., 2018).

This approach ensures that storage is O(m+nnz)O(m+\text{nnz}), with access to entire rows being O(row length)O(\text{row length}) and individual entries at worst O(row length)O(\text{row length}) if unsorted.

2. Sparse Matrix-Vector Multiplication (SpMV) Algorithm and Complexity

The classical CSR-based SpMV computes y=Axy = Ax as follows:

1
2
3
4
5
6
for i in 0…m–1:
    sum ← 0.0
    for k in row_ptr[i] … row_ptr[i+1]–1:
        j = col_idx[k]
        sum += val[k] * x[j]
    y[i] ← sum
Formally,

i: yi=k=row_ptr[i]row_ptr[i+1]1val[k]xcol_idx[k]\forall\,i:~ y_i = \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} \texttt{val}[k] \cdot x_{\texttt{col\_idx}[k]}

Each nonzero performs one multiply-add, with three memory loads (val\texttt{val}, col_idx\texttt{col\_idx}, and an indirect xx), plus two row_ptr\texttt{row\_ptr} loads per row (Scheffler et al., 2023, Liu et al., 2015). The algorithm is O(nnz)O(\text{nnz}) in floating-point operations and O(nnz)O(\text{nnz}) in memory access (dominated by matrix and vector terms). This makes CSR extremely efficient for row-oriented sparse computations.

3. Architectural and Algorithmic Enhancements

3.1 Hardware-Accelerated Streaming

Sparse Stream Semantic Registers (SSSR) eliminate instruction overhead in CSR SpMV by configuring hardware streams for val\texttt{val}, col_idx\texttt{col\_idx}, and xx accesses. Registers act as stream endpoints—each hardware load triggers the next CSR element, enabling back-to-back FMA instructions. On RISC-V, this produces up to 80%80\% FPU utilization and $5$–7×7\times speedups over baseline in-order implementations (Scheffler et al., 2023).

Parallel algorithms leverage the contiguous storage of row data to maximize instruction- and thread-level parallelism, coalesced memory accesses, and effective load balancing across architectures, particularly on GPU and multicore CPUs.

3.2 Memory Access Optimization

On GPUs, row-major arrangement in CSR arrays and merge-based load balancing assign each thread block or warp a contiguous chunk of the nonzero index space, reducing memory transaction count by up to 32×32\times. This eliminates row-length-induced load imbalance and aligns with hardware coalescing footprints (Yang et al., 2018).

3.3 Storage-Reduced Modifications

Diagonal Addressing (DA-CSR) stores column indices as signed 16-bit offsets d=crd = c-r, leveraging low matrix bandwidth after ordering via, e.g., Reverse Cuthill–McKee. This reduces memory traffic by $17$–25%25\% in memory-bound applications and yields commensurate performance gains. For matrices with B<215B < 2^{15}, this enables storage of indices in 2 bytes with unchanged semantics, applicable to over 95%95\% of tested SuiteSparse matrices (Saak et al., 2023).

CSR5 introduces lightweight tiling and segment descriptors, augmenting CSR with tiling metadata. Each tile’s entries are stored in column-major order with bit flags marking segment heads. This approach, with 2%\sim2\% extra storage overhead, achieves up to 1.18×1.18\times speedup on CPU and up to 6.4×6.4\times on GPUs for irregular problems, while retaining a low conversion cost (2\sim2–$4$ SpMV times for GPU, $10$–$20$ for CPU/Xeon Phi) (Liu et al., 2015).

4. Extensions and Algorithmic Uses

CSR’s structure facilitates not only basic matrix–vector and matrix–matrix products, but also more complex transformations:

  • Polynomial Feature Expansion: CSR can be operated on directly for KK-degree expansions by leveraging closed-form bijections based on KK-simplex numbers. The mapping ensures direct computation and indexation of expanded features:

TK(n)=(n+K1K)T_K(n) = \binom{n+K-1}{K}

For input dimensionality DD and density dd, the time complexity becomes Θ(dKDK)\Theta(d^K D^K), yielding up to dKd^K speedup over dense expansions; exact allocation is possible via a pre-count pass followed by nonzero enumeration (Nystrom et al., 2018).

  • Common Subexpression Elimination (CSE): When matrix elements are drawn from a small weight alphabet and patterns repeat across columns, a random search algorithm can extract two-term common subexpressions, storing them as adder trees alongside a pruned CSR matrix. This reduces both memory footprint (by over 50%50\% at α=0.25,U=2\alpha=0.25, U=2) and runtime (by up to 20%20\% for UU small), with each CSE node reused in zz_\ell rows (Bilgili et al., 2023).

5. Performance, Platform Considerations, and Limitations

CSR’s row-oriented design aligns with high-performance computing memory hierarchies, but performance is sensitive to:

  • Row Length Variability: Highly irregular row distributions degrade SIMD/SIMT utilization under standard CSR, motivating hybrid or tiled variants such as CSR5.
  • Memory-Boundedness: For large matrices exceeding cache capacity, memory traffic is the dominant performance limiter. Strategies like DA-CSR that halve index size (from 32 to 16 bits) directly translate index traffic savings to SpMV speedups of $17$–20%20\% (Saak et al., 2023).
  • Hardware Parallelism: Performance scaling on CPUs and GPUs depends on exploiting parallel streams, coalesced loads, and efficient reduction of partial sums. Merge-based and warp-centric approaches in GPU SpMM maximize both bandwidth use and computational occupancy (Yang et al., 2018).
  • Format Conversion Overheads: Advanced variants (e.g., CSR5) ensure low setup costs, typically redundant after tens of SpMV iterations in iterative solvers (Liu et al., 2015).

6. Comparative Table: Variants and Platform Suitability

Variant Key Structural Modification Platform Benefit
Standard CSR 3 arrays: row_ptr, col_idx, val Wide support; fast on regular matrices (Scheffler et al., 2023)
DA-CSR 16-bit signed diagonal offsets 17–25% speedup for B<215B<2^{15}; memory-bound (Saak et al., 2023)
CSR5 Tiling (ω×σ\omega\times\sigma), tile_desc Irregular workloads, GPU/CPU/Xeon Phi, up to 6.4×\times speedup (Liu et al., 2015)
CSR + CSE Common subexpression adder trees, weight factoring Quantized/pruned DL models, >50%>50\% storage, 20%20\% time reduction (Bilgili et al., 2023)

CSR remains dominant due to its compactness, compatibility, and predictable memory access patterns. Platform-specific variants address irregular sparsity, bandwidth limitations, or recurring value patterns while typically preserving the foundational row-compressed indexing and streaming semantics.

7. Research Directions and Broader Impact

Ongoing research explores:

  • Hardware-software co-design for CSR and derived formats to maximize in-core FPU utilization via streaming, hardware-controlled indirection, and minimal memory overhead (Scheffler et al., 2023).
  • Adaptation of CSR to domain-specific requirements: e.g., low-precision architectures, graph pattern matching, and PDE solvers, along with efficient format conversion pipelines and online reordering.
  • Integration of algebraic transformations (such as CSE) for pruned, quantized models in deep learning inference, targeting edge devices with extreme resource constraints (Bilgili et al., 2023).
  • Unified cross-platform data structures that maintain high throughput across CPUs, GPUs, and vector accelerators, especially for mixed-sparsity workloads (Liu et al., 2015).

A plausible implication is that as sparse computation moves deeper into hardware, the logical structure and amenability to streaming of CSR-derived representations will continue to serve as the architectural baseline for both research and deployment.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compressed Sparse Row (CSR) Matrix.