Sequence Parallelism: Long Sequence Training from System Perspective (2105.13120v3)

Published 26 May 2021 in cs.LG and cs.DC

Abstract: Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm perspective. In this work, we propose sequence parallelism, a memory-efficient parallelism method to help us break input sequence length limitation and train with longer sequences on GPUs efficiently. Our approach is compatible with most existing parallelisms (e.g. data parallelism, pipeline parallelism and tensor parallelism), which means our sequence parallelism makes 4D parallelism possible. More importantly, we no longer require a single device to hold the whole sequence. That is, with sparse attention, our sequence parallelism enables us to train transformer with infinite long sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e. GPU). To compute the attention output, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs. With sparse attention, sequence can handle sequence with over 114K tokens, which is over $27\times$ longer than existing sparse attention works holding the whole sequence on a single device.

PDF Abstract

Sequence Parallelism: Enabling Long Sequence Training for Transformers

The paper "Sequence Parallelism: Long Sequence Training from System Perspective" addresses a notable bottleneck in training transformer models, which is the quadratic memory requirement of self-attention mechanisms concerning input sequence lengths. The authors introduce a novel parallelism technique, termed sequence parallelism, to effectively train transformers with longer sequences by distributing chunks of the sequence across multiple devices (e.g., GPUs). This approach is distinct from other strategies that typically attempt to mitigate memory issues from an algorithmic standpoint, focusing instead on a system-level solution.

Approach and Contributions

Transformer models have become ubiquitous in NLP and have shown impressive proficiency across multiple domains, including computer vision and bioinformatics. However, their scalability is often restricted by the memory-intensive self-attention mechanism, which encumbers training on long sequences. Traditional methods, such as sparse attention or attention approximation, address this by reducing complexity, but the authors pursue a different perspective by proposing sequence parallelism to manage memory distribution innovatively.

Key Contributions:

Sequence Parallelism Implementation: The authors designed a mechanism where the input sequence is split into multiple smaller chunks, each assigned to a separate device. This split allows transformers to process each part in parallel without necessitating storage of the entire sequence on a single device, thus significantly reducing memory usage.
Ring Self-Attention (RSA): To enable efficient computation across devices, a novel approach named Ring Self-Attention is proposed. RSA circulates key and value embeddings across GPUs in a ring-like manner, ensuring each device has the requisite data to compute attention outputs.
Compatibility and Scalability: The sequence parallelism is demonstrated to be compatible with existing parallelism techniques such as data, pipeline, and tensor parallelism, allowing for a proposed conceptual 4D parallelism. This compatibility paves the way for potentially vast scalability of models.

Experimental Findings

The authors conducted experiments using BERT models to validate their proposed system, with results indicating substantial enhancements in both sequence length and batch size capabilities. Noteworthy findings include:

On clusters utilizing 64 NVIDIA P100 GPUs, sequence parallelism extends the feasible sequence length up to three times and batch sizes over thirteen times that permitted by traditional tensor parallelism.
Incorporating sparse attention, sequence parallelism enables processing of sequences with over 114K tokens, which surpasses existing sparse attention methods by more than 27-fold.
Memory efficiency is significantly improved under sequence parallelism, lending to scalability in actual deployment scenarios, especially in domains requiring processing of extensive sequences like medical imaging.

Implications and Future Work

The implications of introducing sequence parallelism are substantial, particularly in democratizing access to training transformers on long sequences without extraordinary resource requirements. Practically, it will enable applications requiring in-depth sequence processing, such as detailed image analysis or comprehensive document understanding, in a more feasible manner. Theoretically, it challenges traditional limits of memory-usage in self-attentive architectures, proposing new directions for distributed deep learning frameworks.

Future pathways speculated upon by the authors include the realization of 4D parallelism by integrating sequence with data, pipeline, and tensor parallelism. Such an integration holds the potential to revolutionize training capabilities across various dimensions (model size, batch size, sequence length, and parallel breadth), ultimately contributing to the development of extraordinarily large-scale and efficient models.

In summary, this paper presents a robust and scalable approach to overcoming the sequence length limitations inherent in self-attention models, with substantial implications for both the theoretical underpinnings and practical executions of transformer-based AI systems.