The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers (2310.02041v1)
Abstract: To enhance the computational efficiency of quantized Transformers, we replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations but maintains much of the core functionality of conventional dot-product attention. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption. Training experiments on four common benchmark tasks show test set prediction scores comparable to those of conventional Transformers with dot-product attention. Our scaling experiments also suggest significant computational savings, both in plaintext and under encryption. In particular, we believe that the ReLU and addition-based attention mechanism introduced in this paper may enable privacy-preserving AI applications operating under homomorphic encryption by avoiding the costly multiplication of encrypted variables.
- Ashish Vaswani et al. Attention is all you need. In Advances in NeurIPS, volume 30. Curran Associates, Inc., 2017.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- Yinhan Liu et al. Roberta: A robustly optimized bert pretraining approach. 07 2019.
- Zhilin Yang et al. Xlnet: Generalized autoregressive pretraining for language understanding. 06 2019.
- Zihang Dai et al. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. 2019.
- Tom Brown et al. Language models are few-shot learners. 05 2020.
- John Schulman et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
- Alexey Dosovitskiy and other. An image is worth 16x16 words: Transformers for image recognition at scale. 10 2020.
- Ze Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 03 2021.
- End-to-End Object Detection with Transformers, pages 213–229. 11 2020.
- Shuchang Zhou et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. 06 2016.
- Benoit Jacob et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, jun 2018.
- Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. 06 2018.
- Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). volume 57, pages 10–14, 02 2014.
- Song Han et al. Learning both weights and connections for efficient neural networks. 06 2015.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. 10 2016.
- Distilling the knowledge in a neural network. 03 2015.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 10 2019.
- AutoFormer: Searching transformers for visual recognition. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, oct 2021.
- Fastformer: Additive attention can be all you need, 2021.
- A study on relu and softmax in transformer, 2023.
- Relu and addition-based gated rnn, 2023.
- Craig Gentry. Computing arbitrary functions of encrypted data. Communications of the ACM, 53(3):97–105, mar 2010.
- Layer normalization. 07 2016.
- Long short-term memory. Neural Computation, 9(8):1735–1780, nov 1997.
- MNIST handwritten digit database. 2010. http://yann.lecun.com/exdb/mnist/.
- Andrew L. Maas et al. Learning word vectors for sentiment analysis. In Proc. ACL 2023, Portland, Oregon, USA, June 2011.
- Urs-Viktor Marti and H. Bunke. The iam-database: An english sentence database for offline handwriting recognition. IJDAR, 5:39–46, 11 2002.
- Connectionist temporal classification. In Proc. ICML 2006, New York, NY, USA, 2006. ACM.
- Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, mar 2001.
- Ilaria Chillotti et al. TFHE: Fast fully homomorphic encryption over the torus. Journal of Cryptology, 33(1):34–91, apr 2019.
- Zama. Concrete: TFHE Compiler for Python programs, 2022. https://github.com/zama-ai/concrete.
- I. Chillotti et al. Programmable bootstrapping enables efficient homomorphic inference of deep neural networks. 2015.