Sequence Parallelism: Enabling Long Sequence Training for Transformers
The paper "Sequence Parallelism: Long Sequence Training from System Perspective" addresses a notable bottleneck in training transformer models, which is the quadratic memory requirement of self-attention mechanisms concerning input sequence lengths. The authors introduce a novel parallelism technique, termed sequence parallelism, to effectively train transformers with longer sequences by distributing chunks of the sequence across multiple devices (e.g., GPUs). This approach is distinct from other strategies that typically attempt to mitigate memory issues from an algorithmic standpoint, focusing instead on a system-level solution.
Approach and Contributions
Transformer models have become ubiquitous in NLP and have shown impressive proficiency across multiple domains, including computer vision and bioinformatics. However, their scalability is often restricted by the memory-intensive self-attention mechanism, which encumbers training on long sequences. Traditional methods, such as sparse attention or attention approximation, address this by reducing complexity, but the authors pursue a different perspective by proposing sequence parallelism to manage memory distribution innovatively.
Key Contributions:
- Sequence Parallelism Implementation: The authors designed a mechanism where the input sequence is split into multiple smaller chunks, each assigned to a separate device. This split allows transformers to process each part in parallel without necessitating storage of the entire sequence on a single device, thus significantly reducing memory usage.
- Ring Self-Attention (RSA): To enable efficient computation across devices, a novel approach named Ring Self-Attention is proposed. RSA circulates key and value embeddings across GPUs in a ring-like manner, ensuring each device has the requisite data to compute attention outputs.
- Compatibility and Scalability: The sequence parallelism is demonstrated to be compatible with existing parallelism techniques such as data, pipeline, and tensor parallelism, allowing for a proposed conceptual 4D parallelism. This compatibility paves the way for potentially vast scalability of models.
Experimental Findings
The authors conducted experiments using BERT models to validate their proposed system, with results indicating substantial enhancements in both sequence length and batch size capabilities. Noteworthy findings include:
- On clusters utilizing 64 NVIDIA P100 GPUs, sequence parallelism extends the feasible sequence length up to three times and batch sizes over thirteen times that permitted by traditional tensor parallelism.
- Incorporating sparse attention, sequence parallelism enables processing of sequences with over 114K tokens, which surpasses existing sparse attention methods by more than 27-fold.
- Memory efficiency is significantly improved under sequence parallelism, lending to scalability in actual deployment scenarios, especially in domains requiring processing of extensive sequences like medical imaging.
Implications and Future Work
The implications of introducing sequence parallelism are substantial, particularly in democratizing access to training transformers on long sequences without extraordinary resource requirements. Practically, it will enable applications requiring in-depth sequence processing, such as detailed image analysis or comprehensive document understanding, in a more feasible manner. Theoretically, it challenges traditional limits of memory-usage in self-attentive architectures, proposing new directions for distributed deep learning frameworks.
Future pathways speculated upon by the authors include the realization of 4D parallelism by integrating sequence with data, pipeline, and tensor parallelism. Such an integration holds the potential to revolutionize training capabilities across various dimensions (model size, batch size, sequence length, and parallel breadth), ultimately contributing to the development of extraordinarily large-scale and efficient models.
In summary, this paper presents a robust and scalable approach to overcoming the sequence length limitations inherent in self-attention models, with substantial implications for both the theoretical underpinnings and practical executions of transformer-based AI systems.