Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (2006.16236v3)

Published 29 Jun 2020 in cs.LG and stat.ML

Abstract: Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

PDF Abstract

Overview of "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention"

This paper addresses a critical computational limitation of Transformer models, specifically their quadratic complexity with respect to sequence length, which makes them prohibitively slow for long sequences. The authors introduce a novel reformulation of the self-attention mechanism to dramatically reduce this complexity.

Key Contributions

Linearization of Self-Attention:
- The authors propose expressing self-attention as a linear dot-product of kernel feature maps, leveraging the associativity property of matrix products.
- This reformulation reduces the complexity from $O(N^2)$ to $O(N)$ , where $N$ is the sequence length.
Iterative Implementation:
- Utilizing the proposed linear formulation, the paper presents an iterative implementation that significantly accelerates autoregressive Transformers. This aligns Transformers more closely with the computational patterns of Recurrent Neural Networks (RNNs).
Empirical Performance:
- Extensive evaluations on tasks such as image generation and automatic speech recognition demonstrate that the linear Transformers perform comparably to traditional Transformers but are substantially faster. For instance, the proposed models were up to 4000x faster in autoregressive prediction scenarios for very long sequences.

Detailed Analysis

Linear Transformers Formulation

The core idea is to reformulate the self-attention mechanism to achieve linear complexity. Traditional self-attention involves calculating an attention matrix with a size proportional to the square of the sequence length, leading to $O(N^2)$ complexity. By reinterpreting the similarity computation as a linear dot-product of kernel feature maps, and using associative properties, the authors reduce this complexity to $O(N)$ .

Transition to RNN-like Structures

The paper highlights a previously unexplored relationship between Transformers and RNNs. By presenting the Transformer as an RNN through the lens of causal masking and iterative updates, it shows that transformers can be used effectively for autoregressive tasks with significantly lower computational overhead. This theoretical shift allows for faster inference times, which is crucial for real-time and large-scale applications.

Empirical Validation

Synthetic Task Performance:
- On synthetic sequence duplication tasks, linear Transformers demonstrated stable convergence and achieved loss values comparable to traditional softmax-based Transformers.
Image Generation:
- For datasets like MNIST and CIFAR-10, linear Transformers exhibited competitive bits-per-dimension metrics while generating images multiple orders of magnitude faster than traditional models. For instance, linear Transformers generated CIFAR-10 images 4452x faster than softmax Transformers without sacrificing performance quality.
Automatic Speech Recognition (ASR):
- In ASR tasks on the WSJ dataset, the linear Transformer not only outperformed traditional bi-directional LSTMs and Reformer models in terms of phoneme error rates but also trained faster, indicating practical advantages in both performance and efficiency.

Implications and Future Directions

The reduced complexity and enhanced speed of linear Transformers open new avenues for deploying Transformer models in practical scenarios where sequence lengths are extensive, such as natural language processing, real-time video processing, and high-resolution image generation. Additionally, this work prompts further investigation into the choice of kernel feature maps and their impact on model performance.

Conclusion

This paper provides substantial advancements in the efficiency of Transformer models through the introduction of linear self-attention mechanisms. By demonstrating the feasibility of RNN-like fast autoregressive Transformers, the authors successfully bridge a crucial gap in making these models more practical for large-scale and real-world applications. Future work can explore more sophisticated kernel approximations and further refine the relationship between RNNs and Transformers to unlock even greater efficiencies.