Overlapping Compute, Communication, and Memory Access

Updated 2 March 2026

Overlapping compute, communication, and memory access is a technique that concurrently executes tasks to hide latency and enhance system efficiency.
It employs fine-grained scheduling, explicit pipelining, and DMA offload engines to improve resource utilization and achieve significant speedups in communication-bound scenarios.
The approach integrates hardware-software co-design and formal performance models to optimize critical paths in parallel computation and distributed training workflows.

Overlapping compute, communication, and memory access is a foundational performance optimization in parallel and distributed computing. The technique aims to hide the latency of communication and memory operations by concurrently executing them with computation, thereby reducing effective wall-clock time and increasing resource utilization. Successful realization of these overlaps requires architectural, algorithmic, and sometimes model-level innovations, ranging from explicit scheduling and pipelining, finer-granular task decomposition, communication offload engines, to architectural decoupling of dependencies. The following sections survey rigorous approaches, formal models, and empirically validated systems for overlapping compute, communication, and memory access.

1. Formal Performance Models for Overlap

Quantitative models of overlapping compute, communication, and memory access establish theoretical performance bounds and guide both algorithmic and systems-level design.

For parallel sparse matrix–vector multiplication (spMVM), a canonical performance model integrates computation, memory, and communication costs per process (assuming compressed row storage, CRS):

Compute time: $T_{\text{compute}} = 2\,Nnz_\text{local} / P_\text{peak}$
Memory time: $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ , where $B_\text{CRS}$ (bytes/flop) encodes locality and re-use [see Eqns. (2), (3) in (Schubert et al., 2011)]
Communication time: $T_{\text{comm}} = \alpha n_\text{msg} + \beta(S_\text{send} + S_\text{recv})$

Without overlap, times simply add: $T_{\text{total}} = T_{\text{compute}} + T_{\text{mem}} + T_{\text{comm}}$ . With ideal overlap (of communication and local computation/data), the critical path reduces to $T_{\text{total,ol}} = \max(T_\text{phase1}, T_{\text{comm}}) + T_\text{phase2}$ —enabling $T_{\text{comm}}$ to be hidden under the local phase if $T_\text{phase1} \geq T_{\text{comm}}$ (Schubert et al., 2011).

In the context of distributed deep learning, similar bounds appear: for compute $T_\text{comp}$ and comm $T_\text{comm}$ per step, ideal two-way overlap yields speedup $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 0. Realized speedups are attenuated by resource contention (slowdown factors) and memory system effects (Agrawal et al., 2024, Pal et al., 11 Dec 2025).

For chunk-level fine-grained scheduling, the overlap ratio $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 1 formalizes the fraction of communication hidden via pipelining (Qiang et al., 28 Jan 2026).

2. Algorithmic and System-Level Techniques

Effective overlap mechanisms span explicit programming patterns, runtime scheduling, chunk/fragment decomposition, DMA offload engines, and hardware–software co-design.

Thread-Based and Task-Based Overlap

In hybrid (MPI + OpenMP) spMVM, “task-mode” overlap dedicates a core or thread per NUMA domain for communication (e.g., posting nonblocking receives and servicing them) while the remainder execute compute kernels (Schubert et al., 2011). Pseudocode in this approach partitions data into “local-only” and “remote-halo” tasks, hence communication and compute can run in parallel.

Fine-Grained Chunk and Tile Scheduling

AutoOverlap (Qiang et al., 28 Jan 2026) abstracts tensor-partitioned work into “chunks” with explicit communication dependencies:

As soon as a chunk is available via P2P or collective op, its dependent tiles (from the computation kernel) can be executed immediately.
Compiler transformations convert conventional kernel main loops to chunk-level pipelines:

$T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 8

This achieves up to 4.7× speedup on communication-bound layers by maximizing pipelined execution and associating memory-prefetch or double-buffering for overlap with computation (Qiang et al., 28 Jan 2026).

Finer-Granularity DMA-Orchestrated Schedules

FiCCO (Pal et al., 11 Dec 2025) advocates decomposing communication “one level deeper” than standard shard-overlap—splitting each already-sharded transfer into $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 2 further sub-transfers, allowing an all-to-all transfer pattern and compute-on-any-arriving-fragment without waiting for full peer shards. Offloading these sub-tiles to GPU DMA engines eliminates compute-core contention and roughly halves communication-induced slowdowns. Heuristics based on op-to-byte ratio, traffic, and kernel intensity drive runtime schedule selection.

DMA Engine and Communication Offload

For modern deep learning accelerators and GPUs, communication collectives executed on main compute cores (SMs/CUs) compete for cache, compute, and HBM bandwidth with forward/backward-pass kernels, leading to substantial overlap inefficiency (Agrawal et al., 2024, Rashidi et al., 2020).

Dedicated offload engines (e.g., ACE in (Rashidi et al., 2020), ConCCL in (Agrawal et al., 2024)) move collectives entirely out of the compute pipeline.
DMA engines in the GPU I/O die or endpoint NICs handle bulk transfer and reduction, while compute units remain 100% available for user kernels.
Empirical results: ConCCL achieves up to 1.67× realized speedup vs. non-overlapped baseline, 72% of the theoretical maximum. ACE reduces required DRAM bandwidth by 3.5× and boosts network utilization by 1.44× (Agrawal et al., 2024, Rashidi et al., 2020).

3. Hardware–Software Co-Design and Architectural Abstractions

Solutions to computation–communication overlap increasingly integrate fine-grained hardware mechanisms and architectural refactoring.

Track-and-Trigger and Near-Memory Compute (T3)

T3 (Pati et al., 2024) introduces per-chunk “trackers” in the GPU memory controller that count stores or DMA writes to buffer regions. When all writes for a chunk arrive, the controller triggers an out-of-band DMA, effecting pipelined comm–compute overlap without full kernel fusion. Near-memory compute (NMC) for collectives (e.g., atomic add on write in DRAM banks) further reduces memory and compute traffic for reduction.

Coordinated by the driver via virtual region maps, the scheduler staggers workgroups and leverages wavefront-level granularity for high throughput and overlap, sustaining 1.41–1.54× speedup and 22–32% reduction in DRAM traffic for large-scale transformers (Pati et al., 2024).

Storage-Class Memory and Overlap Beyond DRAM

The Erudite architecture (Qureshi et al., 2020) demonstrates that SM-style accelerators can overlap thousands of in-flight storage-class memory (NVMe SSD) reads and writes using a direct NVMe queue abstraction, local scheduling, and RDMA-style network messaging. This fine-grained, high-concurrency model enables compute pipelines to proceed with minimal stalls, maintaining arithmetic throughput while hiding high-latency I/O and network. Empirical analysis with ~2,000 concurrent storage requests shrinks “memory wall” effects and achieves 8.8× performance improvements for pointer-chasing workloads.

4. Model-Level and Algorithmic Architectural Innovations

Some recent methods change the computational dependency DAG itself to facilitate system-level overlap.

Decoupling Data Dependencies (“Ladder Residual”)

The “Ladder Residual” architecture (Zhang et al., 11 Jan 2025) for transformers with tensor parallelism algebraically decouples the dependency between layer computation and inter-device communication—allreducing outputs one layer after they are produced (by feeding each computation a “stale” residual from two layers prior). This permits issuing compute for layer $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 3 and collective comm for $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 4 in parallel, formally reducing the critical path from $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 5 per layer to $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 6.

In 70B-parameter transformer inference on 8 GPUs, this yields 23–30% end-to-end latency reduction and >90% GPU occupancy, with negligible memory overhead and no degradation of final model accuracy (Zhang et al., 11 Jan 2025).

Partitioned and Pipelined Communication (“Streaming DiLoCo”)

For distributed training, “Streaming DiLoCo” (Douillard et al., 30 Jan 2025) synchronizes only model fragments (by layer) at staggered intervals, allowing continued computation while any nonoverlapping parameter subset is exchanged. As long as $T_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}$ 7, all communication is hidden. Deterministic fragment schedules and buffer prefetch enable further memory–compute overlap and reduce peak bandwidth to sub-model size (e.g., ¼ of model params per fragment, quantized).

5. Practical Guidelines, Trade-offs, and Outlook

Empirically Supported Best Practices

Employ simple performance models (roofline/code-balance, op-to-byte) to pre-characterize hardware and kernel bounds (Schubert et al., 2011, Pal et al., 11 Dec 2025).
Allocate communication work to cores/threads/SMs/streams not fully occupied by compute; on SMT systems, consider running comm on logical siblings (Schubert et al., 2011).
Prefer hardware or DMA offload for collectives to remove compute/core and cache interference (Agrawal et al., 2024, Rashidi et al., 2020, Pal et al., 11 Dec 2025).
Fine-grained chunk partitioning and pipelined scheduling can recover additional overlap but entail balancing overhead (decomposition inefficiency) and memory contention (Pal et al., 11 Dec 2025, Qiang et al., 28 Jan 2026).
Track and trigger schemes at memory-controller granularity yield high overlap with minimal software refactoring (Pati et al., 2024).

Limitations and Contextual Factors

The overhead for very small or irregular partitioning can overwhelm expected gains; empirically, chunk sizes and tile splits must be autotuned (Qiang et al., 28 Jan 2026).
In pure compute- or memory-bound cases (e.g., spMVM on matrices with large average nonzeros/row or very high arithmetic intensity), overlap brings little additional benefit (Schubert et al., 2011).
Algorithmic decoupling (as in Ladder Residual) can add transient memory cost and requires retraining or fine-tuning for legacy models (Zhang et al., 11 Jan 2025).

Emerging Directions

Hardware advancements (more DMA engines, programmable DMA datapaths, near-memory arithmetic, per-channel cache partitions) could further close the gap to ideal overlap (Agrawal et al., 2024, Pati et al., 2024).
Compiler frameworks that infer optimal chunk and pipelining schedules, integrating memory, communication, and compute dependency graphs, offer automatic and portable overlap maximization (Qiang et al., 28 Jan 2026).
Topology-sensitive scheduling heuristics (mesh, ring, all-to-all) remain crucial for extracting the full bandwidth in non-clique interconnects (Pal et al., 11 Dec 2025).