Out-of-Core Computation & Sharding

Updated 22 September 2025

Out-of-core computation and sharding are techniques used to partition data and workloads into manageable segments, enabling processing of datasets or models that exceed single-device memory limits.
Advanced methods like cache-blocking, asynchronous data transfers, and on-the-fly compression optimize performance by reducing data movement and mitigating memory transfer delays.
In distributed systems and deep learning, modular sharding paradigms scale resource efficiency with analytical models and parallelization, addressing both performance and security challenges.

Out-of-core computation and sharding encompass a set of strategies and algorithms that enable the processing of data and models whose scale exceeds the capacity of any single memory or computational unit. These methods have become essential in scientific computing, high-performance deep learning, and distributed systems where data, model parameters, or working sets far surpass the available on-device memory. Key approaches include partitioning data or computation into manageable tiles or shards, minimizing data movement between memory hierarchies or distributed nodes, overlapping data transfers with computation, and employing algorithmic and system-level optimizations to reduce efficiency losses.

1. Tile Decomposition and Cache-Blocking for Out-of-Core Scientific Computation

A central technique for out-of-core scientific workloads, such as stencil computations (ubiquitous in CFD and PDE solvers), is cache-blocking tiling. The goal is to decompose a large grid into smaller tiles such that each tile’s full footprint—including its halo (overlapping edge data needed by the stencil operator)—fits into the “fast memory” available (e.g., GPU on-device memory or KNL MCDRAM). The footprint formula is: $\mathrm{TileFull} = \mathrm{TileCore} + 2 \times \mathrm{Halo}$ Inside frameworks like OPS, this decomposition is augmented by runtime data dependency analysis that enables a "skewed tiling schedule" – ensuring that once a tile's data is loaded, a maximal sequence of dependent parallel loops are executed before the data is evicted, boosting temporal locality and minimizing redundant data movement.

Explicit data management leverages triple buffering to overlap tile computation with prefetch (of the "right footprint" for the next tile) and write-back (of the "left footprint" results from the previous tile) across three CUDA or device streams. This concurrency model ensures that as the accelerator is computing one tile, data for the next and previous tiles are being staged in and out, effectively hiding memory transfer latency.

Additional optimizations for data transfers include omitting unnecessary host-device synchronization for read-only datasets (which do not require download) and write-first datasets (which do not require upload), and the use of speculative prefetching at chain boundaries to reduce loop chain stall.

This approach has enabled problems up to 48 GB—more than three times the on-chip memory of the device—to be solved with an efficiency penalty as low as 15%. For instance, CloverLeaf 2D on Intel KNL and NVIDIA P100 maintained performance within 16% of the in-core baseline, demonstrating that careful management of data movement and parallel execution is critical for high-efficiency out-of-core stencil computations (Reguly et al., 2017).

2. Unified Abstraction for Accelerator Out-of-Core Execution

Effective out-of-core computation on hybrid architectures depends on software frameworks that insulate algorithmic logic from hardware-specific details. The libhclooc library provides a uniform C++ interface for out‑of‑core matrix-matrix multiplication and related kernels, abstracting away the differences in data transfer and asynchronous execution between CUDA (for GPUs), Intel offload (for Xeon Phi), and OpenCL (for FPGAs):

The library exposes generic memory and stream management APIs (hclMalloc, hclMemcpyAsync, hclDeviceSynchronize) while internally using device-specific implementations.
Out-of-core routines partition large matrices into host- and device-resident sub-blocks, orchestrating data transfers and compute launches to overlap as much as possible via two or more asynchronous streams.
A canonical five-stage pipeline organizes asynchronous transfer of A and B slices, loading of output C blocks, compute (DGEMM), and asynch write-back of updated C back to host.

Performance overheads relative to vendor-tuned solutions remain modest (4–10%), while portability and code productivity are greatly improved (with observed 75% reductions in LOC for MMOOC). This demonstrates that API-level virtualization is feasible for high-performance out-of-core primitives across accelerator classes (Hanlon et al., 2018).

3. Advanced Data Transfer Overlap and Compression Techniques

CPU–GPU data transfer remains a principal bottleneck in out-of-core GPU workloads. Recent work introduces combined pipelined approaches and compression to address this:

On-the-fly compression enables datasets (including stencil halos) to be reduced during host–device transfer. Techniques such as “separate compression”—compressing the non-halo (‘reminder’) and the overlapping (‘halo’) regions of each block independently—permit correct edge dependency handling and data recomposability (Shen et al., 2021, Shen et al., 2022).
Compression rates are typically $r = 0.5$ (64b to 32b via cuZFP), halving PCIe transfer volume; employing CUDA streams for pipelined transfer and decompression, computation, and recompression steps ensures further overlap, pushing the bottleneck from transfer to compute.
Single working buffer strategies, reliant on a DAG-based scheduling of CUDA events across stages, can reduce GPU memory consumption by 33%, since three half-sized compressed buffers and one full working buffer can handle three-stage pipelined computation without resource conflict (Shen et al., 2022).

Precision degradation due to lossy compression remains acceptably low (relative errors in the 10⁻⁶–10⁻⁷ range for up to 4,320 stencil time steps) while overall speedups of 1.1–1.2× for out-of-core codes are documented. These results show that compression is a pragmatic complement to pipelined data movement in out-of-core architectures.

4. Sharding Paradigms in Distributed and Blockchain Systems

Sharding—the decomposition of data, compute, or consensus into loosely coupled subsets—plays a dual role: it both enables out-of-core operation by distributing large-state workloads and addresses the scalability trilemma in distributed ledgers.

Coded Sharding in Blockchains (PolyShard): Polynomially coded sharding (via Lagrange interpolation and Reed–Solomon decoding) assigns each node a coded mixture of all shards. Each node stores and computes over these coded aggregates, which when decoded allow recovery of the verification output for all shards. This achieves linear scaling in security, throughput, and storage efficiency with the number of nodes $N$ (tolerating up to $(N-K)/2$ adversarial nodes and achieving throughput/ storage gains factor $K$ ) (Li et al., 2018).
Modular Sharding Architectures: Decomposition into node selection, epoch randomness, node assignment, intra-shard consensus (PBFT/HotStuff or proof-of-X), cross-shard transaction processing (2PC, relay, split models), reconfiguration, and incentive mechanisms isolate concerns and allow rigorous analysis of scalability and attack surfaces (Liu et al., 2021).
Resource-Aware Analytical Models: Queueing-theoretic models for blockchains model each shard as a queue network (with Poisson arrivals and batch departures for block production), allowing closed-form expressions for throughput in fully sharded and computation-sharded scenarios. The fully sharded model shows quasi-linear scaling; when only computation is sharded but relaying/storage is centralized, bottlenecks in the network queue cap system throughput (Soltani et al., 2022).

Additional sharding innovations include “sharding by account" with unlimited subchains (each account is its own shard, with main-chain rollup confirmation), matrix decomposition and quantization for offloading inference blocks (as in model-agnostic distributed inference), and application of ε-cost sharding for near-optimal load balance and memory locality in massive static function/filter constructions (Kan et al., 2022, Angione et al., 29 Jul 2024, Vigna, 24 Mar 2025).

5. Out-of-Core Sharding in Large-Scale Deep Learning and Model Parallelism

Out-of-core deep learning workloads, especially for large-scale recommendation models or transformers, use sharding to map model components or embedding tables to device memory partitions:

Dynamic Sharding and SPMD Partitioning (GShard): Sharding annotations (split, replicate) in the computation graph are consumed by an SPMD partitioner, generating a single program with communication primitives (AllReduce, AllToAll) for partitioned layers, such as Mixture-of-Experts (MoE) experts with billions of parameters. Automatic rematerialization and halo exchange patterns allow O(1) per-device memory usage, enabling, e.g., training sparse MoE Transformers with 600 B parameters on 2,048 TPUs in 4 days (Lepikhin et al., 2020).
Neural Cost Model–Driven Sharding (NeuroShard): Pre-trained neural cost models rapidly predict the (forward/backward, compute/comm) costs of candidate sharding schemes for DLRM embedding tables, enabling fast search for memory and computation-balanced sharding plans even at multi-terabyte scales. This supports up to 11.6% cost reduction and 6.6% end-to-end throughput gain (128 GPU production DLRM) (Zha et al., 2023).
MoE Expert Tensor Sharding (MoEShard): Row- and column-wise decomposition of expert matrices across all GPUs achieves perfect load balancing for MoE inference, eliminating token drops or excess replication and improving time-to-first-token by up to 6.4× over DeepSpeed-style expert parallelism (Balmau et al., 11 Mar 2025).
Adjoint Sharding for Very Long Contexts: For state space model (SSM) LLMs, adjoint sharding breaks the backward pass into shards of independent vector–Jacobian products (VJPs), enabling memory-efficient training over sequences up to or above 100K tokens by distributing VJP computation. The truncated variant restricts backward steps to the most recent $\bar{T}$ tokens, reducing the quadratic compute cost with negligible loss in gradient quality, and distributed implementations deliver up to 3× memory saving and substantial speedup for context windows that would otherwise exceed hardware resources (Xu et al., 1 Jan 2025).

6. Sharding and Out-of-Core Execution in NoSQL and Distributed Datastores

Sharding in distributed datastores, such as MongoDB deployed on HPC clusters, enables out-of-core operation by subdividing storage and query workloads:

Cluster-Queued Execution: A run script assigns HPC cluster nodes to configuration servers (managing metadata), shard servers (owning distinct data chunks), and router nodes (aggregating client queries), while Lustre or similar parallel file systems allow data persistence when nodes are reclaimed by other jobs.
Scaling Properties: Bulk ingest (insertMany) and sharded find queries scale linearly as the number of shards/router pairs increases, with throughput $T \approx k \cdot S \cdot P$ , where $S$ is the number of shards and $P$ the number of concurrent processes (Saxton et al., 2022).
Out-of-Core Implications: Intermediate results and working sets reside in persistent distributed storage, while each node only loads its current shard’s subset, supporting data science workloads that are orders-of-magnitude larger than node memory.

7. Memory Locality and Balance in Massive-Scale Static Data Structures

Sharding is also essential for scaling static functions and filters (e.g., minimal perfect hash or filter constructions) to trillions of keys:

ε-Cost Sharding: By partitioning n items into s shards with high-probability maximum size at most (1+ε)×mean, ε-cost sharding dispenses with per-shard metadata (offsets, local seeds)—facilitating offset-free lookup and predictable memory access patterns. The key guarantee is derived from the balls-in-bins analysis:

$\frac{\frac{n}{s} + \alpha \sqrt{2\frac{n}{s} \ln s}}{\frac{n}{s}} \leq 1 + \varepsilon$

provided $s \ln s \leq \frac{n^2}{2\alpha^2}$

Construction and queries for filters built this way are highly cache-friendly, enabling trillion-key offline builds at 60 ns/key with 10.5% space overhead compared to information-theoretic lower bounds, and with only a nanosecond-level query penalty over non-sharded versions. Trade-offs with alternative methods (like BuRR, which achieve sub-1% overhead but higher latency and slower parallelization) are discussed (Vigna, 24 Mar 2025).

In sum, out-of-core computation and sharding leverage multi-level partitioning (tiles, blocks, shards) with careful management of data dependencies, memory usage, and computation-communication overlap. These principles are realized via specialized tile/block scheduling in HPC, unified APIs and pipelines for heterogeneous accelerators, coded and modular sharding in ledgers, analytical/queueing-theoretic models for throughput quantification, fitness-predicted sharding in deep learning, and memory-local static structures for massive data. The state-of-the-art demonstrates that with appropriately engineered sharding strategies, modern systems are capable of efficiently operating on datasets and models far exceeding the constraints of local memory or compute, with only modest penalties in throughput and latency.