MABViT -- Modified Attention Block Enhances Vision Transformers (2312.01324v2)
Abstract: Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in LLMs. Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.
- Big Vision. https://github.com/google-research/big˙vision.
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311v5.
- Randaugment: Practical Automated Data Augmentation with a Reduced Search Space. In Advances in Neural Information Processing Systems, volume 33, 18613–18624.
- Language Modeling with Gated Convolutional Networks.
- Scaling Vision Transformers to 22 Billion Parameters.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.
- Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
- Improving Transformer Optimization Through Better Initialization. In International Conference on Machine Learning, 4475–4483. PMLR.
- Dual PatchNorm. arXiv, 2302: 2302.01327.
- Understanding the Difficulty of Training Transformers.
- Shazeer, N. 2020. GLU Variants Improve Transformer.
- Talking-Heads Attention.
- NormFormer: Improved Transformer Pretraining with Extra Normalization. arXiv, 2110: 2110.09456.
- How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers. Transactions on Machine Learning Research.
- Going Deeper with Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6144–6153. IEEE.
- Attention Is All You Need. arXiv:1706.03762.
- Deepnet: Scaling Transformers to 1,000 Layers.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax. May 2021.
- ResiDual: Transformer with Dual Residual Connections. arXiv:2304.14802.
- On Layer Normalization in the Transformer Architecture. In Proceedings of the 35th International Conference on Machine Learning, 12126–12135. PMLR.
- Scaling Vision Transformers. Conference on Computer Vision and Pattern Recognition (CVPR).
- Mixup: Beyond Empirical Risk Minimization.
- Mahesh Ramesh (2 papers)
- Aswinkumar Ramkumar (1 paper)