Out-of-Core Linear Algebra

Updated 7 May 2026

Out-of-core linear algebra is a set of algorithms that manage data movement between limited fast memory and extensive slow storage, enabling large-scale matrix computations.
These methods leverage communication optimality and blocking strategies to minimize I/O overhead, achieving near theoretical lower bounds in data transfer.
They are applied in scientific computing, data science, and GPU-accelerated environments to process matrices that exceed in-core memory capacities.

Out-of-core linear algebra encompasses algorithms and computational frameworks that enable the efficient solution of linear algebra problems when the working data greatly exceeds the available fast memory (RAM or device memory). This paradigm requires explicit optimization of data movement (communication) between a limited fast memory and (potentially unbounded) slow memory, such as hard drives or SSDs. Key advances in this field include optimal communication lower bounds for classical matrix factorizations, randomized block algorithms for singular value and rank-revealing decompositions, and general runtime strategies that minimize wall-clock times even as matrices scale far beyond in-core capacity.

1. Two-Level Memory Model and Communication Cost

Central to out-of-core linear algebra is the two-level memory model: computation is performed in fast memory (size $S$ words), while the dataset resides on an unbounded slow memory medium. Data not in fast memory must be explicitly transferred ("read" or "written") with the total communication volume $Q$ —the sum of all words moved—becoming a primary performance constraint (Beaumont et al., 2022, Heavner et al., 2020). The arithmetic intensity $\rho$ (ratio of arithmetic ops to data movement) is dictated by the chosen algorithm's schedule, and the minimum attainable $Q$ is a limiting factor for large-scale problems.

A generic lower bound on communication for fixed-work algorithms is derived as follows. If a subcomputation accessing at most $X$ distinct data items can perform at most $H_{max}$ arithmetic operations, then

$\rho \leq \frac{H_{max}}{X-S} \qquad \implies \qquad Q \geq \frac{V}{\rho} \geq \frac{V(X-S)}{H_{max}}$

where $V$ is the total number of arithmetic operations (Beaumont et al., 2022).

2. Communication-Optimal Algorithms for Symmetric Kernels

Fundamental symmetric linear algebra operations—including the symmetric rank- $k$ update (SYRK) and Cholesky factorization—exhibit structural redundancy that can be exploited for higher arithmetic intensity, and thus reduced data movement, compared to conventional non-symmetric kernels.

For SYRK (computing $C \leftarrow C + AA^T$ for $Q$ 0), the optimal communication lower bound is

$Q$ 1

For Cholesky factorization of $Q$ 2 symmetric positive definite matrices,

$Q$ 3

Both bounds improve on those for non-symmetric kernels (e.g., GEMM, LU) by a factor of $Q$ 4, reflecting the doubled utility of each matrix element in symmetric updates (Beaumont et al., 2022).

Matching algorithms with optimal $Q$ 5 include:

Tiled Block SYRK (TBS): Block decomposition with recursive processing of "triangle" submatrices, enabling communication volume $Q$ 6.
Levelwise Block Cholesky (LBC): Combines OOC Cholesky on diagonal blocks with panel updates via TBS, yielding $Q$ 7.

These approaches establish that for symmetric kernels, the minimum communication volume is $Q$ 8, demonstrating their intrinsic $Q$ 9 higher operational intensity relative to non-symmetric counterparts (Beaumont et al., 2022).

3. Out-of-Core Singular Value and Rank-Revealing Factorizations

Out-of-core SVD, QRCP, and UTV factorizations address the challenge of computing rank structure and low-rank approximations when input matrices (dense or sparse) exceed memory constraints. The leading approaches leverage randomized block algorithms that effectuate bulk data movement, block-level computation, and overlap of I/O with numerical kernels (Lu et al., 2017, Demchik et al., 2019, Heavner et al., 2020).

Out-of-Core Randomized SVD (RSVD)

Block-wise RSVD algorithms partition $\rho$ 0 (dense or sparse) across one or both indices so all submatrices fit in memory. The main steps are:

Sketching: $\rho$ 1 using Gaussian random matrix $\rho$ 2.
Blockwise Power Iterations: Repeated multiplications $\rho$ 3, each pass performed out-of-core.
Out-of-Core Block QR: $\rho$ 4 is orthonormalized via block Gram-Schmidt or TSQR.
Subspace Projection: $\rho$ 5 is formed blockwise and then a small (in-core) SVD is computed: $\rho$ 6.
Reconstruction: $\rho$ 7 assembled out-of-core.

I/O is performed asynchronously; checkpointing supports mid-computation resumption. The total I/O volume is $\rho$ 8size $\rho$ 9 plus minor contributions from $Q$ 0 (Demchik et al., 2019).

Out-of-Core Blocked QRCP / UTV

For column-pivoted QR, a "left-looking" blocked algorithm (HQRRP_left) minimizes writes by only updating the current $Q$ 1 columns of $Q$ 2 per block. Reads per iteration are needed for Gaussian pivot-sketch and local QR, producing an overall read volume of $Q$ 3 words and write volume of $Q$ 4 words for $Q$ 5.

Randomized UTV (randUTV_AB) generalizes this to $Q$ 6 with explicit power-iteration blocks and staged block-level SVDs. Both methods partition $Q$ 7 into $Q$ 8 blocks, schedule tasks to overlap compute and I/O, and maintain an in-RAM least-recently-used (LRU) cache of blocks (Heavner et al., 2020).

A summary of algorithmic features is presented below.

Algorithm	Blocked	Communication-Optimal	Out-of-Core Checkpointing
Blocked SVD/RSVD	Yes	Yes	Yes
HQRRP_left QRCP	Yes	Yes	Not specified
randUTV_AB UTV	Yes	Yes	Yes

4. GPU-Accelerated and High-Performance Out-of-Core Methods

Modern advances leverage GPUs for out-of-core SVD and related tasks, with the host memory or disk supplying streamed partitioned blocks to device memory.

The block randomized SVD (BRSVD) algorithm operates in two main passes—one for sketching and one for subspace projection—using up to two passes over the data regardless of power-iteration exponent $Q$ 9. GPU-specific optimizations include:

In-GPU random matrix generation overlapped with host-device transfers.
GEMM (matrix-matrix multiply) ordering to minimize memory requirements.
Communication-avoiding QR (CAQR) on GPU for orthonormalization.
Block size adaptation to maximize occupancy given constrained device memory.

Performance measurements indicate near-peak in-core GPU performance (up to $X$ 04.8 Tflop/s, $X$ 190% of in-core), 13–35 $X$ 2 CPU speedup, and 3–5 $X$ 3 above naïve multi-GPU approaches, with accuracy matching in-core MKL RSVD (Frobenius-norm error $X$ 4 double/ $X$ 5 single precision) (Lu et al., 2017).

Proposed extensions include multi-GPU or heterogeneous CPU+GPU orchestration, out-of-core tensor decomposition, and asynchronous pipelining for storage-level out-of-core workloads.

5. Practical Implementation Strategies and Memory Planning

Performance and scalability in out-of-core linear algebra depend strongly on block size, prefetching, and overlap of compute with data movement. Key strategies include:

Block Size Selection: Chosen to keep the aggregate block memory below the constraint $X$ 6 but large enough to amortize I/O overhead and optimize CPU/GPU utilization. For SSDs and $X$ 7, block sizes of $X$ 8 are typical (Heavner et al., 2020), while dense RSVD recommends $X$ 9 such that the largest of several possible block-wise product sizes fits within $H_{max}$ 0 (Demchik et al., 2019).
Threading and I/O: Designate a dedicated I/O thread per storage device for asynchronous data movement, with all remaining cores assigned to numerical kernels. RAM management integrates an LRU block cache and reserves space for filesystem buffer cache (Heavner et al., 2020).
Checkpointing and Persistence: All intermediate blocks are checkpointed to disk with metadata and checksums, enabling robust "resume" operation in the event of failure—significantly improving practical usability in extreme matrix scales (Demchik et al., 2019).
Overlap and Scheduling: The runtime builds block-wise dependency digraphs, dynamically schedules ready tasks, and avoids re-reading non-updated data blocks (e.g., left-looking QR).

With careful implementation, the real-world slowdown relative to in-core methods remains moderate (e.g., out-of-core SVD or QRCP at $H_{max}$ 12–5 $H_{max}$ 2 walltime on modern SSDs vs. RAM), with compute often dominating unless memory is extremely constrained (Heavner et al., 2020, Demchik et al., 2019).

6. Performance Characteristics and Application Domains

Empirical benchmarks confirm the scalability of out-of-core algorithms:

Out-of-core RSVD and blocked factorizations scale to matrices with $H_{max}$ 3 rows/columns, both dense and sparse, and handle arbitrarily small memory limits, with sub-linear increases in runtime due to overlap and optimized scheduling (Demchik et al., 2019, Heavner et al., 2020).
GPU-accelerated out-of-core SVD retains high throughput even for data far exceeding GPU memory, with minimal loss in accuracy (Lu et al., 2017).
In symmetric kernel factorizations, communication lower bounds are achieved up to leading order by TBS and LBC (Beaumont et al., 2022).

Out-of-core linear algebra methods are now fundamental in domains ranging from robust PCA in computer vision to large-scale data science pipelines and scientific computing, wherever data volumes preclude classic in-memory techniques.

7. Theoretical and Algorithmic Implications

The asymptotic lower bounds on communication volume ( $H_{max}$ 4 for dense matrix factorizations), and their realization via carefully blocked and scheduled algorithms, define the practical frontier of out-of-core linear algebra (Beaumont et al., 2022, Heavner et al., 2020). The $H_{max}$ 5-fold operational intensity improvement for symmetric kernels (SYRK, Cholesky) highlights the algorithmic leverage conferred by exploiting mathematical structure.

Blocked randomized algorithms (both for low-rank approximation and full rank-revealing factorizations) demonstrate that randomized sketches, power iterations, and block-level factorizations can be orchestrated effectively in external memory settings.

As device and storage-level parallelism advances, these approaches remain essential for scaling core linear algebraic operations to the exabyte regime and beyond. A plausible implication is that as SSD and NVMe bandwidths increase and overlap techniques continue to mature, out-of-core algorithms will achieve near parity with in-memory performance for a growing set of problem classes (Heavner et al., 2020).

References