Random Feature Attention (2103.02143v2)

Published 3 Mar 2021 in cs.CL

Abstract: Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on LLMing and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

PDF Abstract

Random Feature Attention: A More Efficient Variant for Sequence Models

The paper titled "Random Feature Attention" introduces a novel attention mechanism within Transformer architectures, named Random Feature Attention (Rfa). At its core, Rfa leverages random feature methods to approximate the traditional softmax function used in attention mechanisms, with the primary aim of improving computational efficiency.

Motivation and Implementation

The traditional softmax attention in Transformers is highly effective but suffers from a computational bottleneck due to its quadratic time and space complexity concerning the sequence length. This limitation becomes prohibitive when dealing with very long sequences. Rfa addresses this inefficiency by utilizing a linear time and space approximation through random feature methods. By employing random Fourier features derived from Bochner's theorem, this approach provides an adept estimation for the Gaussian kernel, making it possible to perform attention with a kernel trick that is computationally less expensive.

Rfa can be seamlessly integrated as a substitute for conventional softmax attention. One notable feature of Rfa is its connection to recurrent neural networks, leading to a novel gating mechanism inspired by gated RNNs, allowing Rfa to learn with a recency bias when necessary. This gating mechanism shows particular efficacy in tasks where recent inputs are more pertinent, like LLMing.

Empirical Evaluation

The paper presents robust empirical evidence across several established tasks, including LLMing, machine translation, and long text classification. Notably, on LLMing tasks, the Rfa with gating mechanisms outperformed the baseline Transformer on WikiText-103, demonstrating a significant reduction in perplexity metrics, which speaks to an improved predictive capability. In machine translation tasks across datasets such as WMT14 EN-DE and EN-FR, as well as IWSLT14 DE-EN, Rfa showed competitive BLEU scores closely aligned with the baseline but with substantial gains in decoding speed—reported to be about twice as fast. On long text classification, evidence from the Long Range Arena benchmark indicated Rfa's suitability for tasks involving lengthy sequence processing, maintaining accuracy while reducing computational overhead.

Implications and Future Directions

The introduction of Rfa holds both theoretical and practical implications. Theoretically, it provides a method to more efficiently compute attention without sacrificing accuracy, which could influence how Transformers are structured and implemented in practice. Practically, Rfa's improved efficiency in handling long sequences and its ability to perform well with moderate sequence lengths suggest it could see broad applications in machine translation systems, real-time natural language processing tasks, and models that require scalability with sequence length.

Looking towards the future, Rfa offers fertile ground for further exploration, particularly in leveraging pre-trained models. Initial results suggest that pre-trained knowledge from conventional Transformer models can be adapted in Rfa with finetuning, which could significantly reduce the costs of training large models from scratch. This adaptation potential could revolutionize how researchers and practitioners deploy large-scale AI models in environments with strict computational constraints.

In conclusion, the paper asserts a promising direction for enhancing the efficiency of Transformers. The findings suggest that with further research and refinement, random feature-based approaches might become standard in the repertoire of techniques used to handle attention in deep learning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Hao Peng (291 papers)
Nikolaos Pappas (188 papers)
Dani Yogatama (49 papers)
Roy Schwartz (74 papers)
Noah A. Smith (224 papers)
Lingpeng Kong (134 papers)

Citations (322)

View on Semantic Scholar

Random Feature Attention (2103.02143v2)

Random Feature Attention: A More Efficient Variant for Sequence Models

Motivation and Implementation

Empirical Evaluation

Implications and Future Directions

Related Papers