Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism (2406.18485v1)

Published 26 Jun 2024 in cs.DC

Abstract: Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.

The paper introduces EpicSeq, a system designed for efficient training of LLMs (LLM) with long sequences on large-scale Graphics Processing Unit (GPU) clusters. It addresses limitations in existing sequence parallelism approaches, such as head parallelism and context parallelism, which encounter scalability and communication efficiency issues, respectively. EpicSeq's core innovation is the 2D-Attention mechanism, which combines head-parallel and context-parallel techniques to overcome scalability constraints while maintaining efficiency.

The limitations of existing sequence parallelism approaches are:

  • Head parallelism's scalability is inherently limited by the number of attention heads.
  • Context parallelism suffers from communication inefficiencies due to peer-to-peer communication, leading to low intra-node and inter-node bandwidth utilization.

EpicSeq proposes a hybrid approach to overcome these limitations.

2D-Attention Mechanism

The 2D-Attention mechanism parallelizes attention across both head and context dimensions. It distributes the query (Q), key (K), and value (V) tensors across GPUs based on the head dimension and partitions them into chunks within the context dimension. The number of GPUs, dspd_{sp}, is organized into a dhp×dcpd_{hp} \times d_{cp} grid where:

dsp=dhp×dcpd_{sp} = d_{hp} \times d_{cp}

  • dhpd_{hp} is the head parallel size
  • dcpd_{cp} is the context parallel size

In multi-head attention, the input tensors QQ, KK, and VV are divided along the sequence dimension, where each segment is shaped as (H,S/dsp,D/H)(H, S/d_{sp}, D/H).

  • HH is the number of attention heads
  • SS is the sequence length
  • DD is the hidden dimension size

The 2D-Attention computation involves three steps:

  1. A SeqAlltoAll communication operation distributes the QQ, KK, and VV tensors based on the head dimension across dhpd_{hp} GPUs and re-partitions them along the sequence dimension across dcpd_{cp} GPUs.
  2. Each context parallel group independently performs Double-Ring-Attention, resulting in an output tensor of shape (H/dhp,S/dcp,D/H)(H/d_{hp}, S/d_{cp}, D/H).
  3. Another SeqAlltoAll operation consolidates the attention outputs across the head dimension and re-partitions the sequence dimension, transforming the output tensor to (H,S/dsp,D/H)(H, S/d_{sp}, D/H).

To address the constraint of limited KV heads in Grouped Query Attention (GQA), EpicSeq uses KV replication. In the forward pass, the input KV tensors are shaped as (Hkv,S/dsp,D/H)(H_{kv}, S/d_{sp}, D/H). To align the number of KV heads with the head-parallel size, 2D-Attention replicates KV tensors, resulting in the shape of (H^kv,S/dsp,D/H)(\hat{H}_{kv}, S/d_{sp}, D/H), where dhpH^kvHd_{hp} \leq \hat{H}_{kv} \leq H.

Double-Ring-Attention

To fully utilize available Network Interface Cards (NICs) for inter-node communication, the paper proposes Double-Ring-Attention, which partitions the dcpd_{cp} GPUs into multiple inner rings. The Central Processing Units (CPUs) within each context parallel group form several inner rings, while the inner rings collectively form an outer ring. Assuming each inner ring consists of ww GPUs, a context parallel process group would have dcp/wd_{cp} / w concurrent inner rings.

Device Placement Strategies

The paper discusses two device allocation strategies: head-first placement and context-first placement. Head-first placement prioritizes collocating GPUs of the same head parallel group on the same node, leveraging NVLink for SeqAlltoAll operations. Context-first placement prioritizes collocating GPUs of the same context parallel group on the same node, reducing inter-node traffic during Double-Ring-Attention.

Performance Analysis

The paper provides a performance analysis of 2D-Attention, including scalability, computation, peer-to-peer communication, SeqAlltoAll communication, and memory usage. The analysis considers factors such as sequence length, head and context parallelism degrees, inner ring size, Multi-Head Attention (MHA) vs. GQA, and device placement strategies. The goal is to minimize the communication time that cannot be overlapped with computation, which is formulated as:

minTSeqAlltoAll+(Tinner_ringfwd+Tinner_ringbwd)×(dcp/w)\min T_{SeqAlltoAll} + (T_{inner\_ring}^{fwd} + T_{inner\_ring}^{bwd}) \times (d_{cp}/w)

  • TSeqAlltoAllT_{SeqAlltoAll} represents the SeqAlltoAll communication time.
  • Tinner_ringfwdT_{inner\_ring}^{fwd} and Tinner_ringbwdT_{inner\_ring}^{bwd} represent the forward and backward execution time per inner ring
  • dcpd_{cp} is the context parallel size
  • ww is the inner ring size

End-to-End System Implementation

The paper discusses the end-to-end system implementation of EpicSeq with two techniques: hybrid Zero Redundancy Optimizer (ZeRO) and selective checkpoint++. The hybrid ZeRO approach shards model states across both Data Parallelism (DP) and sequence parallelism dimensions, reducing redundant memory usage. Selective checkpoint++ adds attention modules to a whitelist. During the forward pass, the modified checkpoint function saves the outputs of these modules. During the backward pass, the checkpoint function retrieves the stored outputs and continues the computation graph.

Evaluation Results

The paper presents experimental results comparing EpicSeq with DeepSpeed-Ulysses and Megatron Context Parallelism. The results demonstrate that EpicSeq outperforms these baselines in both end-to-end training speed and scalability, improving Model FLOPs Utilization (MFU) by up to 2.88x. The evaluation includes training 7B-MHA and 7B-GQA models on 64 GPUs with various sequence lengths and configurations. The results highlight the benefits of 2D-Attention, Double-Ring-Attention, and Selective Checkpoint++.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Diandian Gu (5 papers)
  2. Peng Sun (210 papers)
  3. Qinghao Hu (31 papers)
  4. Ting Huang (26 papers)
  5. Xun Chen (166 papers)
  6. YingTong Xiong (5 papers)
  7. Guoteng Wang (6 papers)
  8. Qiaoling Chen (14 papers)
  9. Shangchun Zhao (4 papers)
  10. Jiarui Fang (16 papers)
  11. Yonggang Wen (84 papers)
  12. Tianwei Zhang (199 papers)
  13. Xin Jin (285 papers)
  14. Xuanzhe Liu (59 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com