Reformer: The Efficient Transformer (2001.04451v2)

Published 13 Jan 2020 in cs.LG, cs.CL, and stat.ML

Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

PDF Abstract

Reformer: The Efficient Transformer

Introduction

The paper "Reformer: The Efficient Transformer" addresses the computational and memory inefficiencies of Transformer models, particularly when dealing with long sequences. While Transformers have achieved state-of-the-art results across various tasks in NLP and beyond, the resources required to train such large models have become prohibitive. This paper introduces two significant modifications to the Transformer architecture: locality-sensitive hashing (LSH) attention and reversible residual layers. These changes significantly enhance memory efficiency and computational performance while maintaining comparable accuracy to the standard Transformer model.

Memory and Computational Constraints of Transformers

The conventional Transformer architecture, while powerful, suffers from quadratic time and memory complexity in the length of input sequences due to dot-product attention. Additionally, the residual connections necessitate storing activations for back-propagation, which scales linearly with the number of layers. Consequently, the large-scale models and extended sequences often used in applications exceed the capacity of single accelerators, demanding complex multi-device setups even for tasks like fine-tuning.

Methodological Innovations

The Reformer model introduces two key techniques to address these limitations:

Locality-Sensitive Hashing (LSH) Attention:
- Transforms the attention mechanism from O( $L^2$ ) to O( $L \log L$ ) complexity by focusing on a subset of nearest neighbors for attention computation.
- Implements multi-round LSH attention to reduce the risk of missing important tokens by hashing multiple times and combining the results.
- Ensures memory-efficient attention by limiting the attention span to the most relevant portions of the sequence.
Reversible Residual Layers:
- Adopts the RevNet approach to avoid storing multiple copies of activations by making the forward and backward passes reversible.
- In the reversible Transformer, the combination of attention and feed-forward layers in a reversible block allows recovery of activations for back-propagation without additional memory overhead.

Experimental Results

The paper evaluates these techniques across multiple tasks, demonstrating that Reformer maintains performance on par with traditional Transformers while substantially improving efficiency.

Synthetic Task:
- The duplication task shows that even with LSH attention, the model achieves near-perfect accuracy with an acceptable number of LSH hashes.
LLMing (enwik8):
- For sequences of length 64K, Reformer matches the canonical Transformer in accuracy but with far superior memory efficiency and speed.
- A 12-layer Reformer model achieves 1.05 bits/dim on the enwik8 task, evidencing its capability to handle large datasets effectively.
Image Generation (imagenet-64):
- On sequences of length 12K, Reformer performs comparably to the best-known models while being faster and more memory-efficient.

Implications and Future Directions

The Reformer model's efficient memory use and improved computational demands have significant implications for the training and deployment of large-scale models across various domains. By enabling the training of deeper and more complex models on single accelerators, Reformer democratizes access to cutting-edge NLP and generative architectures.

Theoretical Implications:

The combination of LSH attention and reversible layers presents a novel approach to reducing the complexity inherent in deep learning models.
This dual-focus on both the architecture of attention mechanisms and memory management could inspire further innovations in designing efficient model architectures.

Practical Implications:

Reformer could pave the way for more widespread adoption of Transformers in environments with limited computing resources.
The ability to efficiently handle long sequences broadens the scope of potential applications, including but not limited to time-series analysis, music generation, and video processing.

Future Developments:

Future research could explore extending these efficiency techniques to other architectures and even hybrid models.
Investigating the integration of Reformer with pre-trained models like BERT or GPT could potentially enhance their usability without the accompanying resource drain.

Conclusion

The Reformer introduces a compelling approach to addressing the inefficiencies of Transformer models. By leveraging LSH for attention mechanisms and implementing reversible residual layers, this model retains the high performance of traditional Transformers while drastically reducing memory and computational requirements. The innovations detailed in this paper have fundamental implications for the future of model training and deployment, particularly in resource-constrained environments, and open new avenues for efficiently handling long sequences in various applications.