Papers
Topics
Authors
Recent
Search
2000 character limit reached

Out-of-Core Linear Algebra

Updated 7 May 2026
  • Out-of-core linear algebra is a set of algorithms that manage data movement between limited fast memory and extensive slow storage, enabling large-scale matrix computations.
  • These methods leverage communication optimality and blocking strategies to minimize I/O overhead, achieving near theoretical lower bounds in data transfer.
  • They are applied in scientific computing, data science, and GPU-accelerated environments to process matrices that exceed in-core memory capacities.

Out-of-core linear algebra encompasses algorithms and computational frameworks that enable the efficient solution of linear algebra problems when the working data greatly exceeds the available fast memory (RAM or device memory). This paradigm requires explicit optimization of data movement (communication) between a limited fast memory and (potentially unbounded) slow memory, such as hard drives or SSDs. Key advances in this field include optimal communication lower bounds for classical matrix factorizations, randomized block algorithms for singular value and rank-revealing decompositions, and general runtime strategies that minimize wall-clock times even as matrices scale far beyond in-core capacity.

1. Two-Level Memory Model and Communication Cost

Central to out-of-core linear algebra is the two-level memory model: computation is performed in fast memory (size SS words), while the dataset resides on an unbounded slow memory medium. Data not in fast memory must be explicitly transferred ("read" or "written") with the total communication volume QQ—the sum of all words moved—becoming a primary performance constraint (Beaumont et al., 2022, Heavner et al., 2020). The arithmetic intensity ρ\rho (ratio of arithmetic ops to data movement) is dictated by the chosen algorithm's schedule, and the minimum attainable QQ is a limiting factor for large-scale problems.

A generic lower bound on communication for fixed-work algorithms is derived as follows. If a subcomputation accessing at most XX distinct data items can perform at most HmaxH_{max} arithmetic operations, then

ρHmaxXS    QVρV(XS)Hmax\rho \leq \frac{H_{max}}{X-S} \qquad \implies \qquad Q \geq \frac{V}{\rho} \geq \frac{V(X-S)}{H_{max}}

where VV is the total number of arithmetic operations (Beaumont et al., 2022).

2. Communication-Optimal Algorithms for Symmetric Kernels

Fundamental symmetric linear algebra operations—including the symmetric rank-kk update (SYRK) and Cholesky factorization—exhibit structural redundancy that can be exploited for higher arithmetic intensity, and thus reduced data movement, compared to conventional non-symmetric kernels.

For SYRK (computing CC+AATC \leftarrow C + AA^T for QQ0), the optimal communication lower bound is

QQ1

For Cholesky factorization of QQ2 symmetric positive definite matrices,

QQ3

Both bounds improve on those for non-symmetric kernels (e.g., GEMM, LU) by a factor of QQ4, reflecting the doubled utility of each matrix element in symmetric updates (Beaumont et al., 2022).

Matching algorithms with optimal QQ5 include:

  • Tiled Block SYRK (TBS): Block decomposition with recursive processing of "triangle" submatrices, enabling communication volume QQ6.
  • Levelwise Block Cholesky (LBC): Combines OOC Cholesky on diagonal blocks with panel updates via TBS, yielding QQ7.

These approaches establish that for symmetric kernels, the minimum communication volume is QQ8, demonstrating their intrinsic QQ9 higher operational intensity relative to non-symmetric counterparts (Beaumont et al., 2022).

3. Out-of-Core Singular Value and Rank-Revealing Factorizations

Out-of-core SVD, QRCP, and UTV factorizations address the challenge of computing rank structure and low-rank approximations when input matrices (dense or sparse) exceed memory constraints. The leading approaches leverage randomized block algorithms that effectuate bulk data movement, block-level computation, and overlap of I/O with numerical kernels (Lu et al., 2017, Demchik et al., 2019, Heavner et al., 2020).

Out-of-Core Randomized SVD (RSVD)

Block-wise RSVD algorithms partition ρ\rho0 (dense or sparse) across one or both indices so all submatrices fit in memory. The main steps are:

  1. Sketching: ρ\rho1 using Gaussian random matrix ρ\rho2.
  2. Blockwise Power Iterations: Repeated multiplications ρ\rho3, each pass performed out-of-core.
  3. Out-of-Core Block QR: ρ\rho4 is orthonormalized via block Gram-Schmidt or TSQR.
  4. Subspace Projection: ρ\rho5 is formed blockwise and then a small (in-core) SVD is computed: ρ\rho6.
  5. Reconstruction: ρ\rho7 assembled out-of-core.

I/O is performed asynchronously; checkpointing supports mid-computation resumption. The total I/O volume is ρ\rho8sizeρ\rho9 plus minor contributions from QQ0 (Demchik et al., 2019).

Out-of-Core Blocked QRCP / UTV

For column-pivoted QR, a "left-looking" blocked algorithm (HQRRP_left) minimizes writes by only updating the current QQ1 columns of QQ2 per block. Reads per iteration are needed for Gaussian pivot-sketch and local QR, producing an overall read volume of QQ3 words and write volume of QQ4 words for QQ5.

Randomized UTV (randUTV_AB) generalizes this to QQ6 with explicit power-iteration blocks and staged block-level SVDs. Both methods partition QQ7 into QQ8 blocks, schedule tasks to overlap compute and I/O, and maintain an in-RAM least-recently-used (LRU) cache of blocks (Heavner et al., 2020).

A summary of algorithmic features is presented below.

Algorithm Blocked Communication-Optimal Out-of-Core Checkpointing
Blocked SVD/RSVD Yes Yes Yes
HQRRP_left QRCP Yes Yes Not specified
randUTV_AB UTV Yes Yes Yes

4. GPU-Accelerated and High-Performance Out-of-Core Methods

Modern advances leverage GPUs for out-of-core SVD and related tasks, with the host memory or disk supplying streamed partitioned blocks to device memory.

The block randomized SVD (BRSVD) algorithm operates in two main passes—one for sketching and one for subspace projection—using up to two passes over the data regardless of power-iteration exponent QQ9. GPU-specific optimizations include:

  • In-GPU random matrix generation overlapped with host-device transfers.
  • GEMM (matrix-matrix multiply) ordering to minimize memory requirements.
  • Communication-avoiding QR (CAQR) on GPU for orthonormalization.
  • Block size adaptation to maximize occupancy given constrained device memory.

Performance measurements indicate near-peak in-core GPU performance (up to XX04.8 Tflop/s, XX190% of in-core), 13–35XX2 CPU speedup, and 3–5XX3 above naïve multi-GPU approaches, with accuracy matching in-core MKL RSVD (Frobenius-norm error XX4 double/ XX5 single precision) (Lu et al., 2017).

Proposed extensions include multi-GPU or heterogeneous CPU+GPU orchestration, out-of-core tensor decomposition, and asynchronous pipelining for storage-level out-of-core workloads.

5. Practical Implementation Strategies and Memory Planning

Performance and scalability in out-of-core linear algebra depend strongly on block size, prefetching, and overlap of compute with data movement. Key strategies include:

  • Block Size Selection: Chosen to keep the aggregate block memory below the constraint XX6 but large enough to amortize I/O overhead and optimize CPU/GPU utilization. For SSDs and XX7, block sizes of XX8 are typical (Heavner et al., 2020), while dense RSVD recommends XX9 such that the largest of several possible block-wise product sizes fits within HmaxH_{max}0 (Demchik et al., 2019).
  • Threading and I/O: Designate a dedicated I/O thread per storage device for asynchronous data movement, with all remaining cores assigned to numerical kernels. RAM management integrates an LRU block cache and reserves space for filesystem buffer cache (Heavner et al., 2020).
  • Checkpointing and Persistence: All intermediate blocks are checkpointed to disk with metadata and checksums, enabling robust "resume" operation in the event of failure—significantly improving practical usability in extreme matrix scales (Demchik et al., 2019).
  • Overlap and Scheduling: The runtime builds block-wise dependency digraphs, dynamically schedules ready tasks, and avoids re-reading non-updated data blocks (e.g., left-looking QR).

With careful implementation, the real-world slowdown relative to in-core methods remains moderate (e.g., out-of-core SVD or QRCP at HmaxH_{max}12–5HmaxH_{max}2 walltime on modern SSDs vs. RAM), with compute often dominating unless memory is extremely constrained (Heavner et al., 2020, Demchik et al., 2019).

6. Performance Characteristics and Application Domains

Empirical benchmarks confirm the scalability of out-of-core algorithms:

  • Out-of-core RSVD and blocked factorizations scale to matrices with HmaxH_{max}3 rows/columns, both dense and sparse, and handle arbitrarily small memory limits, with sub-linear increases in runtime due to overlap and optimized scheduling (Demchik et al., 2019, Heavner et al., 2020).
  • GPU-accelerated out-of-core SVD retains high throughput even for data far exceeding GPU memory, with minimal loss in accuracy (Lu et al., 2017).
  • In symmetric kernel factorizations, communication lower bounds are achieved up to leading order by TBS and LBC (Beaumont et al., 2022).

Out-of-core linear algebra methods are now fundamental in domains ranging from robust PCA in computer vision to large-scale data science pipelines and scientific computing, wherever data volumes preclude classic in-memory techniques.

7. Theoretical and Algorithmic Implications

The asymptotic lower bounds on communication volume (HmaxH_{max}4 for dense matrix factorizations), and their realization via carefully blocked and scheduled algorithms, define the practical frontier of out-of-core linear algebra (Beaumont et al., 2022, Heavner et al., 2020). The HmaxH_{max}5-fold operational intensity improvement for symmetric kernels (SYRK, Cholesky) highlights the algorithmic leverage conferred by exploiting mathematical structure.

Blocked randomized algorithms (both for low-rank approximation and full rank-revealing factorizations) demonstrate that randomized sketches, power iterations, and block-level factorizations can be orchestrated effectively in external memory settings.

As device and storage-level parallelism advances, these approaches remain essential for scaling core linear algebraic operations to the exabyte regime and beyond. A plausible implication is that as SSD and NVMe bandwidths increase and overlap techniques continue to mature, out-of-core algorithms will achieve near parity with in-memory performance for a growing set of problem classes (Heavner et al., 2020).


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Out-of-Core Linear Algebra.