LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism (2406.18485v1)

Published 26 Jun 2024 in cs.DC

Abstract: Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.

PDF HTML Abstract

The paper introduces EpicSeq, a system designed for efficient training of LLMs (LLM) with long sequences on large-scale Graphics Processing Unit (GPU) clusters. It addresses limitations in existing sequence parallelism approaches, such as head parallelism and context parallelism, which encounter scalability and communication efficiency issues, respectively. EpicSeq's core innovation is the 2D-Attention mechanism, which combines head-parallel and context-parallel techniques to overcome scalability constraints while maintaining efficiency.

The limitations of existing sequence parallelism approaches are:

Head parallelism's scalability is inherently limited by the number of attention heads.
Context parallelism suffers from communication inefficiencies due to peer-to-peer communication, leading to low intra-node and inter-node bandwidth utilization.

EpicSeq proposes a hybrid approach to overcome these limitations.

2D-Attention Mechanism

The 2D-Attention mechanism parallelizes attention across both head and context dimensions. It distributes the query (Q), key (K), and value (V) tensors across GPUs based on the head dimension and partitions them into chunks within the context dimension. The number of GPUs, $d_{sp}$ , is organized into a $d_{hp} \times d_{cp}$ grid where:

$d_{sp} = d_{hp} \times d_{cp}$

$d_{hp}$ is the head parallel size
$d_{cp}$ is the context parallel size

In multi-head attention, the input tensors $Q$ , $K$ , and $V$ are divided along the sequence dimension, where each segment is shaped as $(H, S/d_{sp}, D/H)$ .

$H$ is the number of attention heads
$S$ is the sequence length
$D$ is the hidden dimension size

The 2D-Attention computation involves three steps:

A SeqAlltoAll communication operation distributes the $Q$ , $K$ , and $V$ tensors based on the head dimension across $d_{hp}$ GPUs and re-partitions them along the sequence dimension across $d_{cp}$ GPUs.
Each context parallel group independently performs Double-Ring-Attention, resulting in an output tensor of shape $(H/d_{hp}, S/d_{cp}, D/H)$ .
Another SeqAlltoAll operation consolidates the attention outputs across the head dimension and re-partitions the sequence dimension, transforming the output tensor to $(H, S/d_{sp}, D/H)$ .

To address the constraint of limited KV heads in Grouped Query Attention (GQA), EpicSeq uses KV replication. In the forward pass, the input KV tensors are shaped as $(H_{kv}, S/d_{sp}, D/H)$ . To align the number of KV heads with the head-parallel size, 2D-Attention replicates KV tensors, resulting in the shape of $(\hat{H}_{kv}, S/d_{sp}, D/H)$ , where $d_{hp} \leq \hat{H}_{kv} \leq H$ .

Double-Ring-Attention

To fully utilize available Network Interface Cards (NICs) for inter-node communication, the paper proposes Double-Ring-Attention, which partitions the $d_{cp}$ GPUs into multiple inner rings. The Central Processing Units (CPUs) within each context parallel group form several inner rings, while the inner rings collectively form an outer ring. Assuming each inner ring consists of $w$ GPUs, a context parallel process group would have $d_{cp} / w$ concurrent inner rings.

Device Placement Strategies

The paper discusses two device allocation strategies: head-first placement and context-first placement. Head-first placement prioritizes collocating GPUs of the same head parallel group on the same node, leveraging NVLink for SeqAlltoAll operations. Context-first placement prioritizes collocating GPUs of the same context parallel group on the same node, reducing inter-node traffic during Double-Ring-Attention.

Performance Analysis

The paper provides a performance analysis of 2D-Attention, including scalability, computation, peer-to-peer communication, SeqAlltoAll communication, and memory usage. The analysis considers factors such as sequence length, head and context parallelism degrees, inner ring size, Multi-Head Attention (MHA) vs. GQA, and device placement strategies. The goal is to minimize the communication time that cannot be overlapped with computation, which is formulated as:

$\min T_{SeqAlltoAll} + (T_{inner\_ring}^{fwd} + T_{inner\_ring}^{bwd}) \times (d_{cp}/w)$

$T_{SeqAlltoAll}$ represents the SeqAlltoAll communication time.
$T_{inner\_ring}^{fwd}$ and $T_{inner\_ring}^{bwd}$ represent the forward and backward execution time per inner ring
$d_{cp}$ is the context parallel size
$w$ is the inner ring size

End-to-End System Implementation

The paper discusses the end-to-end system implementation of EpicSeq with two techniques: hybrid Zero Redundancy Optimizer (ZeRO) and selective checkpoint++. The hybrid ZeRO approach shards model states across both Data Parallelism (DP) and sequence parallelism dimensions, reducing redundant memory usage. Selective checkpoint++ adds attention modules to a whitelist. During the forward pass, the modified checkpoint function saves the outputs of these modules. During the backward pass, the checkpoint function retrieves the stored outputs and continues the computation graph.

Evaluation Results

The paper presents experimental results comparing EpicSeq with DeepSpeed-Ulysses and Megatron Context Parallelism. The results demonstrate that EpicSeq outperforms these baselines in both end-to-end training speed and scalability, improving Model FLOPs Utilization (MFU) by up to 2.88x. The evaluation includes training 7B-MHA and 7B-GQA models on 64 GPUs with various sequence lengths and configurations. The results highlight the benefits of 2D-Attention, Double-Ring-Attention, and Selective Checkpoint++.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Diandian Gu (5 papers)
Peng Sun (210 papers)
Qinghao Hu (31 papers)
Ting Huang (26 papers)
Xun Chen (166 papers)
YingTong Xiong (5 papers)
Guoteng Wang (6 papers)
Qiaoling Chen (14 papers)
Shangchun Zhao (4 papers)
Jiarui Fang (16 papers)
Yonggang Wen (84 papers)
Tianwei Zhang (199 papers)
Xin Jin (285 papers)
Xuanzhe Liu (59 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1806145613781770573