Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers (2310.02041v1)

Published 3 Oct 2023 in cs.LG

Abstract: To enhance the computational efficiency of quantized Transformers, we replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations but maintains much of the core functionality of conventional dot-product attention. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption. Training experiments on four common benchmark tasks show test set prediction scores comparable to those of conventional Transformers with dot-product attention. Our scaling experiments also suggest significant computational savings, both in plaintext and under encryption. In particular, we believe that the ReLU and addition-based attention mechanism introduced in this paper may enable privacy-preserving AI applications operating under homomorphic encryption by avoiding the costly multiplication of encrypted variables.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Ashish Vaswani et al. Attention is all you need. In Advances in NeurIPS, volume 30. Curran Associates, Inc., 2017.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
  3. Yinhan Liu et al. Roberta: A robustly optimized bert pretraining approach. 07 2019.
  4. Zhilin Yang et al. Xlnet: Generalized autoregressive pretraining for language understanding. 06 2019.
  5. Zihang Dai et al. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics.
  6. Language models are unsupervised multitask learners. 2019.
  7. Tom Brown et al. Language models are few-shot learners. 05 2020.
  8. John Schulman et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
  9. Alexey Dosovitskiy and other. An image is worth 16x16 words: Transformers for image recognition at scale. 10 2020.
  10. Ze Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 03 2021.
  11. End-to-End Object Detection with Transformers, pages 213–229. 11 2020.
  12. Shuchang Zhou et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. 06 2016.
  13. Benoit Jacob et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, jun 2018.
  14. Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. 06 2018.
  15. Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). volume 57, pages 10–14, 02 2014.
  16. Song Han et al. Learning both weights and connections for efficient neural networks. 06 2015.
  17. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. 10 2016.
  18. Distilling the knowledge in a neural network. 03 2015.
  19. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 10 2019.
  20. AutoFormer: Searching transformers for visual recognition. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, oct 2021.
  21. Fastformer: Additive attention can be all you need, 2021.
  22. A study on relu and softmax in transformer, 2023.
  23. Relu and addition-based gated rnn, 2023.
  24. Craig Gentry. Computing arbitrary functions of encrypted data. Communications of the ACM, 53(3):97–105, mar 2010.
  25. Layer normalization. 07 2016.
  26. Long short-term memory. Neural Computation, 9(8):1735–1780, nov 1997.
  27. MNIST handwritten digit database. 2010. http://yann.lecun.com/exdb/mnist/.
  28. Andrew L. Maas et al. Learning word vectors for sentiment analysis. In Proc. ACL 2023, Portland, Oregon, USA, June 2011.
  29. Urs-Viktor Marti and H. Bunke. The iam-database: An english sentence database for offline handwriting recognition. IJDAR, 5:39–46, 11 2002.
  30. Connectionist temporal classification. In Proc. ICML 2006, New York, NY, USA, 2006. ACM.
  31. Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, mar 2001.
  32. Ilaria Chillotti et al. TFHE: Fast fully homomorphic encryption over the torus. Journal of Cryptology, 33(1):34–91, apr 2019.
  33. Zama. Concrete: TFHE Compiler for Python programs, 2022. https://github.com/zama-ai/concrete.
  34. I. Chillotti et al. Programmable bootstrapping enables efficient homomorphic inference of deep neural networks. 2015.

Summary

We haven't generated a summary for this paper yet.