2000 character limit reached
Replacing softmax with ReLU in Vision Transformers (2309.08586v2)
Published 15 Sep 2023 in cs.CV and cs.LG
Abstract: Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
- Better plain vit baselines for imagenet-1k. arXiv preprint arXiv:2205.01580, 2022. URL https://arxiv.org/abs/2205.01580.
- Describing textures in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014. https://arxiv.org/abs/1311.3618.
- Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8609–8613. IEEE, 2013.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
- Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009. https://ieeexplore.ieee.org/document/5206848.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. https://arxiv.org/abs/2010.11929.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
- What can a single attention layer learn? a study through the random features lens. arXiv preprint arXiv:2307.11353, 2023.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pages 4376–4386. PMLR, 2020.
- Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html.
- Collection of textures in colorectal cancer histology. Zenodo https://doi. org/10, 5281, 2016.
- Sima: Simple softmax-free attention for vision transformers. arXiv preprint arXiv:2206.08898, 2022.
- 3d object representations for fine-grained categorization. In International Conference on Computer Vision (ICCV) Workshops, 2013. https://ieeexplore.ieee.org/document/6755945.
- Learning multiple layers of features from tiny images, 2009. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
- Robust training of neural networks using scale invariant architectures. In International Conference on Machine Learning, pages 12656–12684. PMLR, 2022.
- Soft: Softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems, 34:21297–21309, 2021.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- A study on relu and softmax in transformer. arXiv preprint arXiv:2302.06461, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Caltech-ucsd birds 200. 2010.
- Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010.