Papers
Topics
Authors
Recent
2000 character limit reached

ParaDySe: Adaptive Parallel Training Strategy

Updated 24 November 2025
  • ParaDySe is an adaptive parallel strategy switching framework designed to optimize Transformer training by dynamically selecting strategies based on sequence length.
  • It unifies tensor layouts and incorporates hybrid cost models to predict memory and compute time, minimizing OOM issues and communication bottlenecks.
  • Empirical evaluations demonstrate significant training time reductions and increased maximum trainable sequence lengths across various large language models.

ParaDySe is an adaptive parallel-strategy switching framework designed for training Transformer-based LLMs on dynamic sequences with widely varying lengths. Addressing limitations in conventional frameworks that employ static parallelization strategies, ParaDySe enables on-the-fly selection of optimal strategies according to each input sequence, eliminating both communication-parallelization cancellation bottlenecks on short sequences and out-of-memory (OOM) failures on long sequences. ParaDySe unifies multiple leading parallelism schemes under a single tensor layout specification, implements per-strategy cost models, and provides a per-layer, data-dependent dispatch mechanism—all without requiring PyTorch graph recompilation or tensor redistribution. It demonstrates substantial throughput and maximal sequence length gains relative to existing static methods when evaluated on large-scale LLMs and long-sequence datasets (Ou et al., 17 Nov 2025).

1. System Architecture and Unified Functional Modular Design

ParaDySe is composed of three main modules: the Switchable Functional Parallelism Module, the Hybrid Cost Estimation Module, and the Adaptive Parallel Strategy Switching Module. Integration is achieved by replacing traditional fixed-strategy Multi-Head Attention (MHA) and Feed-Forward Network (FFN) calls in standard training loops (e.g., Megatron-LM, DeepSpeed) with ParaDySe’s switchable function abstractions.

A unified tensor-layout specification over a 1D device grid is utilized, ensuring compatibility between parallel strategies and avoiding information loss during tensor partitioning. Inputs (XMHAX_{\text{MHA}}, XFFNX_{\text{FFN}}), parameters (WqkvW_{\text{qkv}}, WprojW_{\text{proj}}, WinW_{\text{in}}, WoutW_{\text{out}}), and outputs (OO, ZZ) are sharded along a single dimension only: batch (b/p×s×hb/p\times s\times h), sequence (b×s/p×hb\times s/p\times h), or hidden (b×s×h/pb\times s\times h/p), with weight matrices supporting row or column partitioning.

Existing parallel methods—Megatron-LM tensor/pipeline/sequence parallelism (TP/SP/CP), Colossal-AI SP, DeepSpeed Ulysses, METP, and ZeRO3—are subsumed in this layout, standardizing input/output shapes (e.g., input/output as [b×(s/p)×h][b\times (s/p)\times h], WqkvW_{\text{qkv}} stored as [(3h/p)×h]T[(3h/p)\times h]^T). ParaDySe’s function library exposes parametric calls

fπ,MHA(XMHA,Wqkv,Wproj)O,fπ,FFN(XFFN,Win,Wout)Zf_{\pi,\text{MHA}}(X_{\text{MHA}}, W_{\text{qkv}}, W_{\text{proj}}) \to O,\qquad f_{\pi,\text{FFN}}(X_{\text{FFN}}, W_{\text{in}}, W_{\text{out}}) \to Z

for each parallel strategy π\pi, such that switching between strategies is performed without reshuffling or redistribution.

2. Sequence-Aware Hybrid Cost Models

For every parallel strategy π\pi, ParaDySe constructs sequence-length-aware models for both peak memory Mπ(s)\mathcal{M}_\pi(s) and compute time Tπ(s)\mathcal{T}_\pi(s) as functions of the current sequence length ss. For lengths up to the maximal profiled value smaxs_{\max}, a Random Forest (RF) regression is used: (Tπ,Mπ)={RFπ(s),ssmax k=0daksk,s>smax(\mathcal{T}_\pi,\,\mathcal{M}_\pi) = \begin{cases} \mathrm{RF}_\pi(s), & s \le s_{\max} \ \sum_{k=0}^d a_k\,s^k, & s > s_{\max} \end{cases} with degree dd of the polynomial fit determined by the Akaike Information Criterion (AIC). Each RF is fit using one-hot encoding of both strategy and model hyperparameters (h,n,L)(h, n, L).

OOM feasibility is captured by storing OOMπ=maxsMπ(s)OOM_\pi = \max_s \mathcal{M}_\pi(s); strategies where Mπ(s)\mathcal{M}_\pi(s) exceeds device memory are excluded for a given ss. This hybrid (RF+polynomial) approach enables extrapolation beyond profiled sequence lengths and models both forward and backward wall-clock time and memory requirements.

3. Heuristic Parallel Strategy Selection

Given batch size bb, sequence length ss, model configuration (h,n,L)(h, n, L), and a parallel strategy set P\mathcal{P}, ParaDySe aims to select per-layer strategies Π=(π1,,πL)\Pi = (\pi_1, \dots, \pi_L) that minimize aggregate training time lTπl(s)\sum_l \mathcal{T}_{\pi_l}(s) subject to total memory constraint lMπl(s)<MemCap\sum_l \mathcal{M}_{\pi_l}(s) < \mathrm{MemCap}.

The selection algorithm is as follows:

  1. If (b,s)(b,s) is previously encountered, retrieve cached Π\Pi^*.
  2. Order P\mathcal{P} by increasing Tπ(s)\mathcal{T}_\pi(s), using memory for tie-breaking.
  3. For each ordered prefix:
    • Attempt to fill all layers with the fastest feasible π\pi; if within memory, select.
    • Otherwise, substitute next-best π\pi' for each layer and test feasibility, accumulating valid mixes.
  4. Select feasible mixes with minimal total lTπl\sum_l \mathcal{T}_{\pi_l}.
  5. If infeasible, revert to the lowest-memory single π\pi across layers.
  6. Employ a smoothing rule: if switch cost is <5%<5\% of total time, reuse the previous layer’s π\pi to avoid oscillatory switching.
  7. Cache result.

Worst-case time complexity is O(P2L2)O(P^2 L^2) per new batch, with practical acceleration due to caching and early exits.

4. Hot-Switching and Execution Semantics

Strategy switching in ParaDySe is implemented as a dispatch pointer update in the forward pass, with no reinitialization or PyTorch graph recompilation. Each transformer layer’s forward computes

current strategy πlselector;call fπl,MHA/FFN\text{current strategy } \pi_l \leftarrow \text{selector} ; \quad \text{call } f_{\pi_l,{\text{MHA/FFN}}}

Internally, fπ,qf_{\pi,q} invokes NCCL collectives (AllReduce, AllGather, ReduceScatter) as dictated by the chosen strategy, maintaining invariant input/output tensor layouts. Transitioning between strategies thus requires only changing the function pointer, with no additional synchronization or tensor redistribution. ParaDySe supports seamless, per-layer, per-batch strategy switching, facilitating a 1D state machine view of layerwise strategy evolution.

5. Integration with Long-Sequence Optimizations

The ParaDySe framework subsumes both conventional high-throughput strategies (e.g., combinations of tensor parallelism and ZeRO3, Ulysses+ZeRO3, ColossalZ) and long-sequence memory-saving methods (e.g., METP) in its strategy set. During training, the cost-based heuristic can automatically combine or alternate between strategy types depending on the observed input sequence length, typically switching to METP for sequences s128s \gg 128K and reverting to sequence-parallel or tensor-parallel schemes for shorter sequences. Expansion of the strategy set to include methods such as sparse attention or gradient checkpointing is straightforward, conditional on compliance with the unified tensor specification.

6. Empirical Evaluation and Performance Analysis

Evaluation is conducted on 8 NVIDIA A100 80 GB GPUs (NVLink), AMD EPYC 7473X CPU, PyTorch 2.5.1, CUDA 12.4, NCCL 2.21.5, and FlashAttention 2.7.4. Tested models include BERT‐Base ((h,n,L)=(1024,16,24)(h, n, L) = (1024, 16, 24)), LLaMA ((8192,64,80)(8192, 64, 80)), and GPT-3-small ((12288,96,96)(12288, 96, 96)). Datasets encompass GitHubCode (max 309K tokens, 65.7% in (0, 4K), 1.1% \geq 128K) and GRCh38 (max 624K, 1.9% \geq 128K).

Key results using ParaDySe versus static baselines (MegatronTS, MegatronCZ, UlyssesZ, ColossalZ, METP):

  • On GitHubCode with BERT (24 layers), ParaDySe reduces total training time by up to 89.3% at the largest ss.
  • For GPT, ParaDySe increases the maximum trainable ss by 144% (best baseline OOM at 170\sim 170K).
  • On GRCh38, BERT achieves a 58% speedup, and GPT extends maximum ss by 181%.
  • LLaMA displays a minor reverse anomaly in baseline performance, attributed to communications, motivating future head-level profiling.

Ablation studies reveal:

  • Removing METP limits max ss from 330K → 112K.
  • Eliminating the RF component restricts max ss to \sim270K.
  • Dropping MegatronTS increases cumulative training time by 12%.
  • Disabling smoothing incurs negligible (0.04%0.04\%) slowdown, but switching patterns become erratic.

ParaDySe’s prediction-driven hot-switching preempts OOM; when OOM is predicted by Mπ(s)\mathcal{M}_\pi(s), a lighter feasible strategy is adopted in real time, contrasting with static methods that require costly reinitialization (∼31 s for imports, dataset reload, rebuild, and recompile).

7. Constraints, Limitations, and Prospective Directions

ParaDySe’s memory modeling granularity is at the per-batch level; more precise, operator-granular models could refine the strategy switchpoints for a given ss. Only 1D device topologies are supported; extending the modular abstraction to 2D/3D tensor- and pipeline-parallel compositions is recognized as a significant opportunity. Presently, the strategy set, while comprehensive, does not include sparse, low-rank, or custom attention mechanisms, but these could be incorporated by adopting the tensor-layout spec. Communication profiling is currently coarse; performance irregularities on LLaMA suggest finer-grained, possibly per-head, profiling would further optimize switching.

Planned future work includes:

  • Operator-level memory/time cost models
  • Support for heterogeneous device clusters
  • Dynamic pipeline parallelism integration
  • End-to-end autoML-driven learning for switch scheduling

ParaDySe establishes a unified, layer-wise hot-switchable parallel strategy framework for Transformer training, combining cost-model-driven per-batch adaptation with a zero-overhead runtime mechanism, robust to extreme sequence length variation and capable of OOM avoidance and throughput maximization on sequences up to 624K tokens (Ou et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ParaDySe.