An Attention Free Transformer

Published 28 May 2021 in cs.LG, cs.CL, and cs.CV | (2105.14103v2)

Abstract: We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (113)

View on Semantic Scholar

Summary

The paper presents an attention-free mechanism that eliminates dot-product self-attention, reducing memory complexity to linear.
It introduces variants AFT-local and AFT-conv that incorporate localized position biases and convolutional features to enhance efficiency.
Empirical results demonstrate competitive performance on tasks such as CIFAR10 image modeling, Enwik8 language modeling, and ImageNet classification.

An Attention Free Transformer

The paper introduces an efficient variant of transformer architectures, entitled the Attention Free Transformer (AFT), which addresses the computational inefficiencies associated with traditional transformers. By eliminating dot product self-attention, AFT reduces memory complexity to linear, facilitating scalability to large inputs and model sizes.

Methodology and Variants

AFT operates by combining keys and values with learned position biases, followed by element-wise multiplication with the query. This modification maintains global connectivity between sequence elements, resembling the interaction pattern in conventional attention mechanisms. However, it significantly reduces the computational overhead by avoiding the explicit computation of attention matrices.

The authors propose notable variants of AFT, namely AFT-local and AFT-conv, which incorporate locality and spatial weight sharing. AFT-local employs localized position biases within a confined range while preserving global interaction, thus improving computational efficiency without sacrificing performance. AFT-conv extends this concept to replicate the characteristics of convolutional neural networks (CNNs) with a global receptive field. It effectively applies spatial weight sharing, allowing AFT-conv models to support variable-sized inputs.

Empirical Evaluation

The authors evaluate AFT on a diverse set of tasks, including CIFAR10 image autoregressive modeling, character-level language modeling with Enwik8, and ImageNet-1K classification. Across these benchmarks, AFT demonstrates competitive performance, delivering improvements in computational efficiency, model capacity, and sometimes outperforming traditional transformers.

Image Autoregressive Modeling (CIFAR10): AFT-local surpasses the performance of conventional transformer models and achieves state-of-the-art results in terms of negative log likelihood (NLL). The results show significant gains in both speed and memory efficiency.
Language Modeling (Enwik8): AFT achieves lower training bits per character compared to baseline models including Reformer, Synthesizer, Linear Transformers, and Performer. It provides competitive test performance with much-reduced memory requirements.
Image Classification (ImageNet-1K): AFT-conv improves classification accuracy compared to the DeiT baseline and supports variable input sizes due to its convolutional nature. Notably, AFT-conv, even with minimal kernel sizes, achieves accuracy comparable to full-sized transformers, asserting the efficacy of global connectivity without explicit attention computation.

Implications and Future Directions

AFT's design constitutes a shift in how transformers address the challenge of scalability and computational complexity. By removing the need for explicit attention matrix computation, AFT presents a potentially significant reduction in memory consumption and calculation time, especially when handling large sequences or high-dimensional data.

The practical implications of AFT span several use cases where computational resources are constrained or model responsiveness is critical. Through the introduction of locality and spatial weight sharing in AFT-backbones (AFT-local and AFT-conv), substantial advantages are evident in the areas of computer vision and autoregressive tasks, suggesting directions for further explorations in efficiency improvements and architectural design refinements.

Future research can explore hybridizing AFT with existing transformer-based solutions to harness advantages from both paradigms or evaluating its performance across a broader spectrum of tasks, including those requiring more sophisticated attention patterns. Furthermore, quantization and sparse modeling techniques could be synergistic with AFT's design, unlocking additional avenues for resource-efficient applications.

In conclusion, the Attention Free Transformer presents a compelling alternative to traditional transformer architectures, offering balance and trade-offs between performance and efficiency through its novel handling of attention, paving the way for potential advancements in efficient deep learning architectures.

Markdown Report Issue