Displaced Patch Parallelism

Updated 4 April 2026

Displaced Patch Parallelism is a strategy that decomposes a global domain into semi-independent patches with dynamic, asynchronous scheduling.
It employs techniques like asynchronous iteration, patch–angle decomposition, and pipeline parallelism to optimize communication and load balancing.
This approach enhances scalability and efficiency in multiphysics simulations, finite element methods, and deep learning inference tasks.

Displaced Patch Parallelism is a parallel computing strategy in which a global computational domain is decomposed into multiple, often semi-independent, "patches" (subdomains) with dynamic, asynchronous, or pipeline-based scheduling of patch updates. Each patch may correspond to a spatial/temporal subregion, a portion of a numerical problem, or a fragment of a data structure, and is typically evolved with its own local state, boundary management, and (optionally) timestep. Displaced patch parallelism is characterized by flexible patch ownership, communication patterns that "displace" data or computation between patches, and the use of patch-specific strategies to achieve scalable and high-performance parallelism in diverse domains such as finite element coupling, multiphysics simulation, radiative transfer, and deep learning inference. Its implementations exploit locality, communication hiding, and overlapping of work across hardware resources.

1. Patch Decomposition: Definitions and Structures

In displaced patch parallelism, the global domain $\Omega$ is partitioned into a set of patches $\{ \Omega^s \}_{s=1}^S$ , where each patch is a connected set of cells/points (in a mesh, image, or graph) possibly equipped with ghost or halo cells for boundary exchange (Shiokawa et al., 2017, Kerim et al., 2023, Yan et al., 2018, Fang et al., 2024, Fang et al., 2024). Common organizational principles include:

Patch Metadata: Each patch maintains identifiers, coordinate transformations (for moving/rotating patches or heterogeneous grids), grid geometry, ghost-cell description, and potentially its own physics modules. Grid-based simulations assign each patch a set of MPI processes or threads; deep learning inference (PipeFusion) partitions transformer layers and activations across devices for each patch (Shiokawa et al., 2017, Fang et al., 2024).
Patch Roles: Patches may be designated global (coarse, background) or local (higher-resolution, moving frames), as in multiphysics or multiscale fluid simulations (Shiokawa et al., 2017). In deep learning, patches correspond to non-overlapping spatial tiles within a latent representation (Fang et al., 2024, Fang et al., 2024).
Ownership and Migration: Patch–task pairs are logical processing units; in some implementations, tasks (e.g., sweep directions in Sn transport) migrate between threads or processes to mitigate load imbalance and maximize resource utilization (Yan et al., 2018).

2. Scheduling, Asynchrony, and Communication Strategies

Displaced patch parallelism supports a spectrum of scheduling and communication patterns, all built to maximize concurrency and hide communication latency.

Asynchronous Iteration: In global–local domain coupling (non-intrusive), the global problem and all local patches proceed independently, exchanging boundary conditions and interface tractions via one-sided communication primitives (MPI-RMA). Ranks do not wait for global barriers; each advances when "new enough" data arrives (Kerim et al., 2023).
Patch–Angle (Task) Decomposition: In mesh sweeps (JSweep), each (patch, direction) pair operates as a "patch-program." Dynamic, data-driven scheduling enables the system to initiate work as soon as dependencies (e.g., upwind data) are satisfied. Streams of data are sent only to needed downstream patches, and work stealing enables redistribution of patch-programs among threads (Yan et al., 2018).
Pipeline Parallelism: In transformer inference (PipeFusion), patches are partitioned and injected into a pipeline of model stages, with each patch processed independently but possibly out of order. The pipeline is kept full by displacing patches through layers, reusing "stale" features where possible to reduce waiting on inter-stage data (Fang et al., 2024, Fang et al., 2024).
Client–Router–Server Protocols: Multipatch fluid simulation employs a three-stage protocol where ghost-cell fill requests are routed through a mapping layer and satisfied by servers owning the relevant data, ensuring balanced inter-patch communication even as patch geometry changes (Shiokawa et al., 2017).

In all cases, only boundary/interface data are exchanged; internal patch state remains local. Coordination of data movement, computation, and synchronization is critical to efficiency.

3. Boundary Exchange and Data Consistency

A key design challenge is the accurate and efficient transmission of data between patch boundaries:

Ghost and Interface Data: Patches define ghost-cell arrays or interface regions where data from neighbors (either direct grid neighbors or overlapping patches) must be interpolated or transferred. Interpolations occur in the coordinate basis of the recipient patch, possibly after transformation (for curvilinear or moving meshes) (Shiokawa et al., 2017).
Conservation and Consistency: While interpolation is often trilinear or local, it does not always guarantee conservation of integral quantities (e.g., mass or energy). In some regimes, fractional errors can reach O( $10^{-2}$ ) unless conservative remapping is used (Shiokawa et al., 2017). For interface coupling (global/local), compatibility and force equilibrium constraints are enforced iteratively to maintain solution accuracy (Kerim et al., 2023).
Temporal Buffering: In pipeline-based architectures (PipeFusion), each stage keeps "stale" activations (such as key/value tensors) from previous steps to sidestep the blocking nature of dependencies, leveraging input temporal redundancy to maintain throughput without sacrificing accuracy (Fang et al., 2024, Fang et al., 2024).

4. Load Balancing, Local Timestepping, and Scalability

Displaced patch schemes emphasize minimizing both computation and communication imbalance:

Local Time-Stepping: Each patch may use its own timestep, determined by local CFL condition or solver constraint. In PATCHWORK, this strategy can reduce total processor-hours by factors of 2–3 when slow patches are rate-limiting (Shiokawa et al., 2017).
Dynamic Scheduling and Work Stealing: Allocation of patches or patch–task pairs to threads/processes is dynamically adjusted in frameworks like JSweep; idle threads steal work from busier ones, leading to logical "displacement" of processing responsibility (Yan et al., 2018).
Asynchronous Termination and Fault Tolerance: Systems advance to completion when residual norms across patches fall below tolerance, with no need for global barriers; in asynchronous coupling frameworks, resilience to network delays and node failures is a natural outcome (Kerim et al., 2023).
Processor Assignment Models: In PATCHWORK, zones per processor for each patch are matched in inverse proportion to the timestep, balancing wall times across heterogeneous patches (Shiokawa et al., 2017).

Strong and weak scaling have been demonstrated for mesh-based solvers up to $O(10^5)$ cores, with parallel efficiencies in the range 20–60% depending on mesh structure and problem size (Yan et al., 2018).

5. Application Domains and Performance

Displaced patch paradigms have been successfully applied across a wide variety of computational science and machine learning contexts:

Multiphysics/Multiscale Fluid and MHD: PATCHWORK supports distinct subregions with different physics, grids, or moving frames, enabling simulation of complex and multiscale phenomena (Shiokawa et al., 2017).
Domain Decomposition and Global/Local Coupling: Structure mechanics, elasticity, and heat transfer benefit from non-intrusive coupling of coarse and fine domains, with asynchronous, patch-based parallelism achieving better load balancing and robustness under heterogeneity (Kerim et al., 2023).
Transport Sweep Solvers: Parallel S $_n$ sweep algorithms for radiative or particle transport rely on patch-centered, data-driven scheduling to overcome the serial bottleneck imposed by inherent directional dependencies (Yan et al., 2018).
Diffusion Model Inference (Patch-Level Pipeline Parallelism): PipeFusion displaces image patches through transformer layers, combining spatial patching with model pipeline parallelism for significant speedups and memory reduction in high-resolution diffusion models (Fang et al., 2024, Fang et al., 2024). Empirical results on 4–8 GPU clusters show up to 3–4× speedups vs. single-GPU baselines and 2× reduction in per-GPU memory.

Performance considerations are summarized as follows:

Domain	Scheduling	Communication	Scaling Efficiency
Fluid/MHD	MPMD, hybrid time	Client–router–server	0.4–0.75× baseline
FEM Coupling	Async MPI-RMA	Interface-only, 1-sided	Faster under imbalance
Sweeps (Sn)	Patch–angle	Directed streams	20–60% (up to 10⁵ cores)
DiT Inference	Patch pipeline	Boundary P2P, reuse	2–4× speedup on PCIe

6. Limitations, Extensions, and Practical Guidelines

While displaced patch approaches offer substantial gains, there are significant considerations:

Increased Iteration Counts: Asynchrony or patch-based scheduling may increase the number of global/local iterations or local solves required for convergence (Kerim et al., 2023).
Implementation Complexity: Correct management of remote memory regions (MPI-RMA), dynamic routing tables, data consistency, and checkpointing increase system complexity (Shiokawa et al., 2017, Kerim et al., 2023, Yan et al., 2018).
Non-Conservative Boundary Schemes: Naive interpolation at patch boundaries may threaten global conservation; conservative mapping or smoothing is required for stringent PDE applications (Shiokawa et al., 2017).
Parameter Selection: Optimal patch size, number of patches, and degree of pipeline parallelism are architecture- and problem-dependent; diminishing returns may result at small patch sizes (PipeFusion) or high communication-to-compute ratios (Fang et al., 2024, Fang et al., 2024).
Extensions: Relaxed interface weights (Aitken, Chebyshev), patch migration for dynamic load balancing, integration with nonlinear solvers, and exploitation of hardware RDMA are proposed for future work (Kerim et al., 2023).

Guidelines extracted from the literature suggest: using ≥ $40^3$ zones per MPI rank to amortize communication, matching processor counts to patch timestep ratios, smoothing patch motion, and exploiting temporal redundancy when possible (Shiokawa et al., 2017). Patch-level pipeline methods are directly integrated into software stacks such as xDiT and can interoperate with sequence and inter-image parallelism for further scalability (Fang et al., 2024).

Displaced patch parallelism, by enabling modular, high-concurrency workflows with distributed grids or tasks, has become a unifying strategy in both scientific computing and modern large-scale model inference. Its efficacy relies on problem-centric decomposition, minimized and overlapped communication, and dynamic, locality-aware scheduling to harness current and future exascale platforms (Shiokawa et al., 2017, Kerim et al., 2023, Yan et al., 2018, Fang et al., 2024, Fang et al., 2024).