Tensor & Sequence Parallelism
- Tensor and Sequence Parallelism (TSP) is a unified strategy that shards both model parameters and activations across devices for efficient large-scale training.
- It reduces per-device memory usage and improves throughput by overlapping communication with computation, especially in long-context scenarios.
- TSP integrates tensor and sequence parallelism via advanced block decompositions and hybrid communication schemes, enhancing scalability and hardware utilization.
Tensor and Sequence Parallelism (TSP) refers to system and algorithmic strategies that jointly shard both model parameters (weights) and input activations (tokens, or sequence positions) across parallel computing resources. By integrating tensor parallelism (TP) and sequence parallelism (SP), TSP reduces both parameter-memory and activation-memory per device, thus alleviating hardware bottlenecks for training and inference on large-scale models, especially with long input contexts. Contemporary TSP approaches include “folding” TP and SP onto a single mesh axis, as well as 2D block decompositions (head-context parallelism) and hybrid topologies combining AllToAll and ring-based communication schemes. TSP enables hardware-aware scaling, supports integration with data, pipeline, and expert parallelism, and underpins recent advances in memory-efficient large-model training (Shyam et al., 29 Apr 2026, Gu et al., 2024, Fang et al., 2024, Li et al., 2021, Miller et al., 2020).
1. Principal Techniques and Architectural Dimensions
TSP incorporates and generalizes both tensor parallelism and sequence parallelism. In TP, model weights (e.g., Q/K/V projections, MLP matrices) are partitioned along feature axes so each device holds a weight shard and associated partial activations (Fang et al., 2024, Shyam et al., 29 Apr 2026). In SP, the sequence dimension of activations (tokens) is split across devices; each device processes a contiguous token block with a full model replica (Li et al., 2021, Fang et al., 2024).
TSP “folds” these orthogonal axes onto a single device axis such that each rank holds both a weight and a sequence shard (Shyam et al., 29 Apr 2026). This structure applies to both dense and Mixture-of-Experts (MoE) architectures, unifies memory scaling, and frees devices for further axes such as data or expert parallelism.
Mechanisms for block partitioning include:
- 2D attention blockwise decomposition: partitioning the attention computation along both head (feature) and sequence (context) axes, creating a grid of blocks, with each block assigned to a device or process group (Gu et al., 2024).
- Ring-based attention: implementing distributed self-attention by rotating key/value or activation slices in a ring topology to all devices handling a sequence-shard, guaranteeing complete context aggregation with limited communication per step (Li et al., 2021, Fang et al., 2024).
- “Unified” group factorizations: arranging process groups into multidimensional meshes that flexibly interpolates between AllToAll (Ulysses) and ring-passing (Ring) for optimal hardware utilization and scaling (Fang et al., 2024).
2. Formal Memory and Communication Analysis
TSP achieves simultaneous reduction in parameter and activation memory by $1/D$ (where is the combined TSP group size) over conventional methods (Shyam et al., 29 Apr 2026).
Table: Per-GPU memory in several parallelism strategies (Shyam et al., 29 Apr 2026)
| Parallelism | Parameter Memory | Activation Memory |
|---|---|---|
| DP | ||
| TP | ||
| SP | ||
| TP+SP | $1/D$0 | |
| TSP | $1/D$1 | $1/D$2 |
Here $1/D$3 is per-layer parameter count, $1/D$4 is bytes/parameter, $1/D$5 is peak activation memory, $1/D$6 is TSP degree.
Communication in TSP combines:
- Broadcast or ring-rotate of projection parameter shards (Q/K/V/MLP weights).
- All-gather of key/value activation blocks by sequence-shard for full attention context.
- Aggregation of partial FFN outputs during MLP forward/backward.
Dominant communication term (per layer, per GPU, under selective recompute) (Shyam et al., 29 Apr 2026):
$1/D$7
where $1/D$8 and $1/D$9 depend on QKV/MLP parameters, 0 is grouped-query ratio, 1 is context length, 2 is hidden dimension, 3 is microbatch.
Compared to TP and SP (which have all-reduce and all-gather bottlenecks, respectively), TSP balances compute and communication, allowing both memory and throughput scaling. The activation communication cost for TSP matches that of SP at the same degree, but TSP incurs additional parameter exchanges (which can be overlapped with compute) (Shyam et al., 29 Apr 2026).
3. Algorithmic Schedules and Implementations
Attention Blocks:
- Each rank computes Q/K/V projections with its own local weights and activation shard.
- Parameter shards are broadcast or rotated to all other ranks so each device eventually generates projections for all parameter splits (Shyam et al., 29 Apr 2026).
- All-gather is performed on local K/V projections along the sequence axis to reconstruct the global attention context per shard.
- Blockwise causal attention (e.g., FlashAttention) is applied locally, followed by projection and fusion of output blocks (Gu et al., 2024, Shyam et al., 29 Apr 2026).
Gated MLP/FFN Blocks:
- MLP weights are circulated in a ring so that each rank sequentially applies each parameter shard to its local input and accumulates the resulting partial outputs.
- No global all-reduce is needed since outputs are local to each activation shard (Shyam et al., 29 Apr 2026).
2D and Unified Group Schedules:
- Several systems arrange devices in a 2D or 4D mesh. For example, LoongTrain partitions along both head and sequence axes; Unified SP (USP) creates a mesh of ring and Ulysses groups, interpolating between pure AllToAll and pure Ring strategies (Gu et al., 2024, Fang et al., 2024).
- These layouts allow for optimal mapping to hardware (intra-node links, NIC count), minimize cross-node data movement, and are compatible with ZeRO, pipeline, and data parallelism (Gu et al., 2024, Fang et al., 2024, Li et al., 2021).
Overlapping Communication and Compute:
- Implementation best practices use dedicated communication streams for broadcast, all-gather, and ring sends/receives.
- Compute kernels (GEMMs, FlashAttention) are scheduled concurrent with weight or context communication (Shyam et al., 29 Apr 2026, Gu et al., 2024).
- On modern accelerators, this overlap effectively hides much of the parameter-shard and sequence-context communication.
4. Comparative Performance and Scaling
Empirical benchmarks across modern architectures and hardware demonstrate distinct TSP scaling properties (Shyam et al., 29 Apr 2026, Gu et al., 2024, Fang et al., 2024, Li et al., 2021):
- Peak Memory: TSP achieves the lowest per-GPU peak memory across a wide range of sequence lengths, matching TP at short context (4 small, parameter-dominates), and matching SP as 5 grows (activation-dominates) (Shyam et al., 29 Apr 2026).
- Throughput: TSP consistently outperforms matched TP+SP factorizations, with throughput advantage widening at larger device count (6), especially on long sequences (7).
- Batch Size and Sequence Length Scaling: SP and TSP enable training with 8 larger batch size and 9 longer context than TP (on 64 GPUs for BERT-Base), as well as support for 0 token sequences using sparse attention kernels (Li et al., 2021).
- Hardware Utilization: LoongTrain achieves up to 1 Model FLOPs Utilization (MFU) over DeepSpeed-Ulysses or Megatron-CP for 1M-token LLM training, and near-linear scaling up to 512 GPUs (Gu et al., 2024). Unified SP (USP) attains up to 2 hardware utilization (MFU) on LLAMA3-8B at 3 context (Fang et al., 2024).
Table: TSP and Baseline MFU for 7B LLMs (Gu et al., 2024)
| Method | Sequence Length | GPUs | Best MFU (%) | Relative Speedup |
|---|---|---|---|---|
| DeepSpeed-Ulysses | 1M | 64 | 36 | 1× |
| Megatron-CP | 1M | 64 | 38 | 1.05× |
| LoongTrain TSP | 1M | 64 | 55 | 1.5× |
This suggests that practical, hardware-aware TSP yields both superior scaling and resource utilization for very long context LLM training versus orthogonal or baseline approaches.
5. Composability with Multidimensional Parallelism
TSP is designed for integration as a mesh axis in “4D” or higher dimensional parallelism frameworks. The canonical mesh ordering is:
- Tensor Parallelism (TP) or TSP at the lowest level (shards model weights/activations)
- Unified Sequence Parallelism (USP) or pure SP
- ZeRO/Data Parallel (DP) for optimizers and gradients
- Pipeline Parallelism (PP) for layer partitioning (Fang et al., 2024, Li et al., 2021, Shyam et al., 29 Apr 2026)
Key recommendations:
- Use DP first when batch allows; introduce SP or TSP to unlock larger context on limited batch size/hardware.
- Always pair SP/TSP with ZeRO-1/2 for optimizer/gradient memory efficiency. Consider ZeRO-3 if memory-bound.
- TSP frees GPUs (vs. TP+SP mesh) for additional DP/PP axes, maximizing overall hardware usage on long contexts (Shyam et al., 29 Apr 2026, Fang et al., 2024).
USP and Loop-2D (e.g., LoongTrain) further exploit hardware topology by adjusting AllToAll vs. Ring trade, mapping high-bandwidth links to sequence or tensor axes (Gu et al., 2024, Fang et al., 2024).
6. Extensions and Related Methods
TSP and its generalizations are applicable beyond conventional transformer models. In tensor network-based sequence models, such as the uniform Matrix Product State (u-MPS), both tensor- and sequence-parallel contraction trees allow 4 depth parallelism with 5 arithmetic, as opposed to 6 for the sequential contraction. This forms a Pareto frontier between arithmetic intensity and sequence-level parallelism (Miller et al., 2020).
TSP is also extensible to expert-models (MoE), mixtures, and structured sampling; and can be composed with regular-expression conditioned sampling and richer conditional generative algorithms (Miller et al., 2020).
7. Limitations and Practical Considerations
- Communication Overhead: TSP introduces weight-movement in each forward pass (broadcast, ring, or all-gather of parameter shards), with an aggregate cost scaling as the sum of activation and weight terms.
- Partitioning and Mesh Constraints: Folding both axes onto a single dimension limits the factorization choices vs. orthogonal TP+SP, but reduces device count per replica and enables denser intra-node placement (Shyam et al., 29 Apr 2026).
- Divisibility Requirements: Sequence length 7 and hidden units 8 must be divisible by the parallel degree 9 for ideal load balancing (Li et al., 2021). Imbalances introduce non-ideal communication and compute.
- Kernel Support: Efficient TSP depends on kernel fusion (e.g., FlashAttention), communication overlap, and optimized collectives (ROCm/RCCL, NVLink, IB/PCIe hierarchies).
- Best Practice Tuning: Hardware-specific tuning (e.g., Double-Ring concurrency in LoongTrain, All2All group placement in USP) is required for maximum MFU.
A plausible implication is that TSP, by subsuming traditional TP and SP into a unified axis, provides a building block for memory-constrained, long-context model training on emerging AI supercomputers (Shyam et al., 29 Apr 2026, Gu et al., 2024, Fang et al., 2024, Li et al., 2021, Miller et al., 2020).