Papers
Topics
Authors
Recent
Search
2000 character limit reached

Overlapping Compute, Communication, and Memory Access

Updated 2 March 2026
  • Overlapping compute, communication, and memory access is a technique that concurrently executes tasks to hide latency and enhance system efficiency.
  • It employs fine-grained scheduling, explicit pipelining, and DMA offload engines to improve resource utilization and achieve significant speedups in communication-bound scenarios.
  • The approach integrates hardware-software co-design and formal performance models to optimize critical paths in parallel computation and distributed training workflows.

Overlapping compute, communication, and memory access is a foundational performance optimization in parallel and distributed computing. The technique aims to hide the latency of communication and memory operations by concurrently executing them with computation, thereby reducing effective wall-clock time and increasing resource utilization. Successful realization of these overlaps requires architectural, algorithmic, and sometimes model-level innovations, ranging from explicit scheduling and pipelining, finer-granular task decomposition, communication offload engines, to architectural decoupling of dependencies. The following sections survey rigorous approaches, formal models, and empirically validated systems for overlapping compute, communication, and memory access.

1. Formal Performance Models for Overlap

Quantitative models of overlapping compute, communication, and memory access establish theoretical performance bounds and guide both algorithmic and systems-level design.

For parallel sparse matrix–vector multiplication (spMVM), a canonical performance model integrates computation, memory, and communication costs per process (assuming compressed row storage, CRS):

  • Compute time: Tcompute=2Nnzlocal/PpeakT_{\text{compute}} = 2\,Nnz_\text{local} / P_\text{peak}
  • Memory time: Tmem=BCRS2Nnzlocal/BWmemT_{\text{mem}} = B_\text{CRS}\cdot 2 Nnz_\text{local} / BW_\text{mem}, where BCRSB_\text{CRS} (bytes/flop) encodes locality and re-use [see Eqns. (2), (3) in (Schubert et al., 2011)]
  • Communication time: Tcomm=αnmsg+β(Ssend+Srecv)T_{\text{comm}} = \alpha n_\text{msg} + \beta(S_\text{send} + S_\text{recv})

Without overlap, times simply add: Ttotal=Tcompute+Tmem+TcommT_{\text{total}} = T_{\text{compute}} + T_{\text{mem}} + T_{\text{comm}}. With ideal overlap (of communication and local computation/data), the critical path reduces to Ttotal,ol=max(Tphase1,Tcomm)+Tphase2T_{\text{total,ol}} = \max(T_\text{phase1}, T_{\text{comm}}) + T_\text{phase2}—enabling TcommT_{\text{comm}} to be hidden under the local phase if Tphase1TcommT_\text{phase1} \geq T_{\text{comm}} (Schubert et al., 2011).

In the context of distributed deep learning, similar bounds appear: for compute TcompT_\text{comp} and comm TcommT_\text{comm} per step, ideal two-way overlap yields speedup Sideal=(Tcomp+Tcomm)/max(Tcomp,Tcomm)S_\text{ideal} = (T_\text{comp} + T_\text{comm}) / \max(T_\text{comp}, T_\text{comm}). Realized speedups are attenuated by resource contention (slowdown factors) and memory system effects (Agrawal et al., 2024, Pal et al., 11 Dec 2025).

For chunk-level fine-grained scheduling, the overlap ratio Rov=(Tcomp+TcommTpipe)/TcommR_{ov} = (T_\text{comp} + T_\text{comm} - T_\text{pipe}) / T_\text{comm} formalizes the fraction of communication hidden via pipelining (Qiang et al., 28 Jan 2026).

2. Algorithmic and System-Level Techniques

Effective overlap mechanisms span explicit programming patterns, runtime scheduling, chunk/fragment decomposition, DMA offload engines, and hardware–software co-design.

Thread-Based and Task-Based Overlap

In hybrid (MPI + OpenMP) spMVM, “task-mode” overlap dedicates a core or thread per NUMA domain for communication (e.g., posting nonblocking receives and servicing them) while the remainder execute compute kernels (Schubert et al., 2011). Pseudocode in this approach partitions data into “local-only” and “remote-halo” tasks, hence communication and compute can run in parallel.

Fine-Grained Chunk and Tile Scheduling

AutoOverlap (Qiang et al., 28 Jan 2026) abstracts tensor-partitioned work into “chunks” with explicit communication dependencies:

  • As soon as a chunk is available via P2P or collective op, its dependent tiles (from the computation kernel) can be executed immediately.
  • Compiler transformations convert conventional kernel main loops to chunk-level pipelines:

1
2
3
4
5
for chunk_id in 0..N_chunks-1:
    issue_async_transfer(comm_schedule[rank][chunk_id])
    wait_for_transfer(comm_schedule[rank][chunk_id])
    for tile in swizzle(chunk_id):
        compute_tile(tile)

  • This achieves up to 4.7× speedup on communication-bound layers by maximizing pipelined execution and associating memory-prefetch or double-buffering for overlap with computation (Qiang et al., 28 Jan 2026).

Finer-Granularity DMA-Orchestrated Schedules

FiCCO (Pal et al., 11 Dec 2025) advocates decomposing communication “one level deeper” than standard shard-overlap—splitting each already-sharded transfer into NN further sub-transfers, allowing an all-to-all transfer pattern and compute-on-any-arriving-fragment without waiting for full peer shards. Offloading these sub-tiles to GPU DMA engines eliminates compute-core contention and roughly halves communication-induced slowdowns. Heuristics based on op-to-byte ratio, traffic, and kernel intensity drive runtime schedule selection.

DMA Engine and Communication Offload

For modern deep learning accelerators and GPUs, communication collectives executed on main compute cores (SMs/CUs) compete for cache, compute, and HBM bandwidth with forward/backward-pass kernels, leading to substantial overlap inefficiency (Agrawal et al., 2024, Rashidi et al., 2020).

  • Dedicated offload engines (e.g., ACE in (Rashidi et al., 2020), ConCCL in (Agrawal et al., 2024)) move collectives entirely out of the compute pipeline.
  • DMA engines in the GPU I/O die or endpoint NICs handle bulk transfer and reduction, while compute units remain 100% available for user kernels.
  • Empirical results: ConCCL achieves up to 1.67× realized speedup vs. non-overlapped baseline, 72% of the theoretical maximum. ACE reduces required DRAM bandwidth by 3.5× and boosts network utilization by 1.44× (Agrawal et al., 2024, Rashidi et al., 2020).

3. Hardware–Software Co-Design and Architectural Abstractions

Solutions to computation–communication overlap increasingly integrate fine-grained hardware mechanisms and architectural refactoring.

Track-and-Trigger and Near-Memory Compute (T3)

T3 (Pati et al., 2024) introduces per-chunk “trackers” in the GPU memory controller that count stores or DMA writes to buffer regions. When all writes for a chunk arrive, the controller triggers an out-of-band DMA, effecting pipelined comm–compute overlap without full kernel fusion. Near-memory compute (NMC) for collectives (e.g., atomic add on write in DRAM banks) further reduces memory and compute traffic for reduction.

  • Coordinated by the driver via virtual region maps, the scheduler staggers workgroups and leverages wavefront-level granularity for high throughput and overlap, sustaining 1.41–1.54× speedup and 22–32% reduction in DRAM traffic for large-scale transformers (Pati et al., 2024).

Storage-Class Memory and Overlap Beyond DRAM

The Erudite architecture (Qureshi et al., 2020) demonstrates that SM-style accelerators can overlap thousands of in-flight storage-class memory (NVMe SSD) reads and writes using a direct NVMe queue abstraction, local scheduling, and RDMA-style network messaging. This fine-grained, high-concurrency model enables compute pipelines to proceed with minimal stalls, maintaining arithmetic throughput while hiding high-latency I/O and network. Empirical analysis with ~2,000 concurrent storage requests shrinks “memory wall” effects and achieves 8.8× performance improvements for pointer-chasing workloads.

4. Model-Level and Algorithmic Architectural Innovations

Some recent methods change the computational dependency DAG itself to facilitate system-level overlap.

Decoupling Data Dependencies (“Ladder Residual”)

The “Ladder Residual” architecture (Zhang et al., 11 Jan 2025) for transformers with tensor parallelism algebraically decouples the dependency between layer computation and inter-device communication—allreducing outputs one layer after they are produced (by feeding each computation a “stale” residual from two layers prior). This permits issuing compute for layer i+1i+1 and collective comm for ii in parallel, formally reducing the critical path from Tcomp+TcommT_{\text{comp}} + T_{\text{comm}} per layer to max(Tcomp,Tcomm)\max(T_{\text{comp}}, T_{\text{comm}}).

  • In 70B-parameter transformer inference on 8 GPUs, this yields 23–30% end-to-end latency reduction and >90% GPU occupancy, with negligible memory overhead and no degradation of final model accuracy (Zhang et al., 11 Jan 2025).

Partitioned and Pipelined Communication (“Streaming DiLoCo”)

For distributed training, “Streaming DiLoCo” (Douillard et al., 30 Jan 2025) synchronizes only model fragments (by layer) at staggered intervals, allowing continued computation while any nonoverlapping parameter subset is exchanged. As long as Tcomm,fragτTcompT_{\text{comm,frag}} \leq \tau T_\text{comp}, all communication is hidden. Deterministic fragment schedules and buffer prefetch enable further memory–compute overlap and reduce peak bandwidth to sub-model size (e.g., ¼ of model params per fragment, quantized).

5. Practical Guidelines, Trade-offs, and Outlook

Empirically Supported Best Practices

Limitations and Contextual Factors

  • The overhead for very small or irregular partitioning can overwhelm expected gains; empirically, chunk sizes and tile splits must be autotuned (Qiang et al., 28 Jan 2026).
  • In pure compute- or memory-bound cases (e.g., spMVM on matrices with large average nonzeros/row or very high arithmetic intensity), overlap brings little additional benefit (Schubert et al., 2011).
  • Algorithmic decoupling (as in Ladder Residual) can add transient memory cost and requires retraining or fine-tuning for legacy models (Zhang et al., 11 Jan 2025).

Emerging Directions

  • Hardware advancements (more DMA engines, programmable DMA datapaths, near-memory arithmetic, per-channel cache partitions) could further close the gap to ideal overlap (Agrawal et al., 2024, Pati et al., 2024).
  • Compiler frameworks that infer optimal chunk and pipelining schedules, integrating memory, communication, and compute dependency graphs, offer automatic and portable overlap maximization (Qiang et al., 28 Jan 2026).
  • Topology-sensitive scheduling heuristics (mesh, ring, all-to-all) remain crucial for extracting the full bandwidth in non-clique interconnects (Pal et al., 11 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Overlapping Compute, Communication, and Memory Access.