Asynchronous Hierarchical Zero Parallelism

Updated 24 October 2025

AsyncHZP is an advanced distributed optimization and scheduling paradigm that hierarchically shards model states and employs asynchronous communication to reduce memory footprint and network overhead.
It leverages adaptive sharding and multi-stream asynchronous scheduling to balance load and minimize synchronization delays during large-scale LLM training and scientific computing.
Empirical results indicate up to 25% higher throughput and 7.5× faster convergence in geo-distributed settings compared to traditional parallelism methods.

Asynchronous Hierarchical Zero Parallelism (AsyncHZP) is an advanced distributed optimization and scheduling paradigm that combines hierarchical model/state sharding with fully asynchronous communication and execution strategies for scalable parallel computing. It is particularly impactful for LLM training and high-performance scientific computing, addressing limitations in memory utilization, load imbalance, and communication bottlenecks across modern clusters and geo-distributed environments.

1. Conceptual Framework

AsyncHZP extends the principle of Zero Redundancy Optimizer (ZeRO) parallelism by introducing a hierarchical sharding and asynchronous scheduling approach. In traditional ZeRO, optimizer states, gradients, and parameters are sharded uniformly across the full data-parallel group, but this practice can incur excessive communication costs, especially at large scale. AsyncHZP redesigns this model by partitioning model states (parameters, gradients, optimizer states) hierarchically across distinct replica groups, assigning independent sharding dimensions $Z_1, Z_2, Z_3$ for each state component. The memory footprint under Hierarchical ZeRO Parallelism (HZP) is thus:

$M_{hzp} = \frac{12N}{Z_1} + \frac{4N}{Z_2} + \frac{2N}{Z_3}$

where $N$ is the total model parameter count.

Further, AsyncHZP exploits non-blocking multi-stream asynchronous scheduling to overlap data communication (parameter all-gather, gradient reduce-scatter) with computation at all stages, deploying these operations in background threads and using memory pooling to minimize fragmentation and idle time.

This paradigm can be generalized beyond deep learning, as demonstrated by compiler/runtime redesigns that leverage hierarchical async techniques to expose fine-grained parallelism and adaptive load balancing (e.g., OP2 + HPX runtime (Khatami et al., 2017)) and by hierarchical asynchronous local SGD for distributed LLM training (HALoS (2506.04531)), which add server-level scheduling and momentum-based update rules.

2. Hierarchical Sharding and Adaptive Resharding

AsyncHZP’s hierarchical sharding concept enables the adaptive division of parameters, gradients, and optimizer states into multiple groups with size optimized for memory and locality constraints. Rather than fixed global sharding across all devices, model states are sliced according to intra-node and inter-node topology, such that compute-intensive or communication-heavy components are assigned to groups with fastest available interconnects (e.g., NVLink for intra-node, Ethernet/InfiniBand for inter-node). This yields a balance between device memory utilization and minimized network overhead.

Key implications:

The memory balance is tunable per state component ( $Z_1$ for parameters, $Z_2$ for gradients, $Z_3$ for optimizer state), as reflected in the formula above.
Group size selection can be adapted to hardware configuration and workload, exploiting high-speed local links for aggressive sharding while limiting fragmentation for large global collectives.
The resharding strategy can be leveraged dynamically by runtime methods, as in OP2+HPX, where chunk sizes and task dependencies are optimized at execution time.

3. Asynchronous Scheduling and Task Execution

AsyncHZP proposes a multi-stream asynchronous scheduling system in which communication primitives—parameter all-gather (AG) and gradient reduce-scatter (RS)—are dispatched in background threads concurrent with main computation. This design eliminates classic bottlenecks where computation must wait for synchronization events.

Operational flow:

Before computation begins for a layer, all-gather is triggered asynchronously to provide parameters "just in time."
During the backward pass, reduce-scatter is issued immediately after gradient calculation, again without blocking the main execution thread.
Memory pools of contiguous sub-stream buffers are pre-allocated for communication, allowing cyclic reuse and negligible fragmentation.
Synchronous event barriers between computation and communication are avoided, yielding nearly 100% overlap.

A schematic flow (cf. Fig. 2, (Bai et al., 23 Oct 2025)) illustrates concurrent processing of transformer blocks in the main thread and non-blocking communication in auxiliary streams.

In hierarchical distributed scenarios (HALoS (2506.04531)), task execution is asynchronous at every level: individual workers update local states, local parameter servers aggregate updates, and global parameter server fusion proceeds independently, all without forced synchronization.

4. Runtime Optimizations and Load Balancing

AsyncHZP frameworks incorporate runtime strategies for minimizing load imbalance and maximizing throughput:

Dynamic Chunk Sizing: Chunk sizes for compute blocks are set adaptively during execution so that each parallel unit achieves approximate equality in wall time, expressed as:

$\exists n_i \text{ such that } T_{chunk} = t(n_i),~~\forall~\text{dependent loops}$

Loop Interleaving: Dependent tasks can be interleaved when inputs are ready, removing global barriers and reducing idle time.
Data Prefetching: Asynchronous prefetch iterators fetch data for future chunks and iterations, minimizing memory latency.
Update Accumulation (HALoS): Local parameter servers aggregate updates over a configurable window $K$ before global merging, averaging out update variance.

These optimizations yield improved load balancing and resource utilization, whether in tightly coupled HPC environments or geo-distributed LLM clusters.

5. Theoretical Guarantees and Empirical Performance

AsyncHZP and its hierarchical asynchronous extensions (HALoS) offer theoretical and experimental support for convergence and scalability:

Convergence: Under non-convex objectives, hierarchical asynchronous optimization achieves convergence at rates comparable to synchronous SGD, with additional terms quantifying the impact of delay and variance. Example bound (Theorem 4.1, (2506.04531)):

$\min_{1 \leq t \leq T} \mathbb{E}[\|\nabla F(\Theta_t)\|^2] \leq \frac{4(F(\Theta_0) - F(\Theta_*) )}{\eta_m T} \left( 1 + \frac{1}{1-\beta_g} \right) + \frac{\eta_0}{\eta_m}\frac{1}{\beta_g^3} \left(3 + 12L\eta_0 + \frac{6L\eta_0}{(1-\beta_g)^2} \right) \left( \frac{G \sigma^2}{(1-\beta_l)(1-\beta_g)} + L^2 D_g^2 + L^2 D_l^2 \right)$

where $\beta_g$ , $\beta_l$ are momentum coefficients, and the constants represent problem and deployment heterogeneity.

Stability: AsyncHZP maintains robust stability at scale under both Dense and Mixture-of-Experts architectures (Seed-OSS 9B/36B, MoE-100B).
Performance: AsyncHZP achieves approximately 25% higher throughput compared with classic ND parallelism (Bai et al., 23 Oct 2025), and up to 7.5× faster convergence compared to synchronous baselines in geo-distributed settings (HALoS (2506.04531)), with improvements in Model FLOPs Utilization (MFU) and linear scaling efficiency.
Scalability: Tests over 256–1024 device clusters indicate improved linearity (91.12% for AsyncHZP vs. 88.37% for ND parallel best configurations) and robust throughput under growing model and hardware scale.

6. Practical Implementation and Deployment

AsyncHZP’s hierarchical and asynchronous strategies are matched to a corresponding implementation architecture:

API and system modifications: In the OP2+HPX design (Khatami et al., 2017), API primitives are extended to emit futures, kernels become dataflow objects, and persistent execution policies dynamically govern chunk sizing.
Multi-stream scheduler: For LLM training, operations are dispatched across main and auxiliary streams; persistent memory pools are configured to manage buffers for communication.
Topology-aware group formation: Sharding group sizes $(Z_1, Z_2, Z_3)$ must be tuned for hardware topology and workload, balancing memory usage and communication demands.
Compatibility: AsyncHZP integrates directly with ND parallelism methods (context, tensor, pipeline) and can co-exist with recomputation or partitioned training strategies.

Challenges include careful management of background execution and synchronization, tuning group sizes and memory pools for extreme scales, and handling straggler effects when links are heterogeneous. Empirical results indicate these issues are tractable with appropriate system and hyperparameter tuning.

7. Outlook and Research Directions

AsyncHZP exemplifies a class of hierarchical asynchronous paradigms that can be generalized to emerging hardware (TPUs, new device architectures) and integrated with additional scalings such as pipeline parallelism or recomputation. Research directions include further scheduling and memory optimizations, topological adaptation for new cluster layouts, and extension to systems with even greater heterogeneity.

A plausible implication is that AsyncHZP’s combination of adaptive hierarchical sharding, non-blocking communication, and runtime task optimization will serve as the foundation for scalable model training and simulation in increasingly diverse and globally distributed computational environments.

PDF Markdown Chat (Pro)

References (3)

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques (2017)

HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training (2025)

AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training (2025)

Follow Topic

Get notified by email when new papers are published related to Asynchronous Hierarchical Zero Parallelism (AsyncHZP).