SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization (2405.11582v2)
Abstract: Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for LLMing task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
- Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, pp. 35–49. Springer, 2022.
- Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv preprint arXiv:2205.14756, 2022.
- End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Springer, 2020.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5961–5971, 2023.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Densemamba: State space models with dense hidden connection for efficient large language models. arXiv preprint arXiv:2403.00818, 2024.
- Huawei. Mindspore. https://www.mindspore.cn/, 2020.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. pmlr, 2015.
- Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
- Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2682–2690, 2019.
- Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17227–17236, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Category feature transformer for semantic segmentation. arXiv preprint arXiv:2308.05581, 2023a.
- Dynamic token pruning in plain vision transformers for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 777–786, 2023b.
- A survey on transformer compression. arXiv preprint arXiv:2402.05964, 2024.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR, 2021.
- Maxvit: Multi-axis vision transformer. In European conference on computer vision, pp. 459–479. Springer, 2022.
- Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 568–578, 2021.
- Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12186–12195, 2022.
- PPT: Token pruning and pooling for efficient vision transformers. arXiv preprint arXiv:2310.01812, 2023.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.
- Dat++: Spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430, 2023.
- Fdvit: Improve the hierarchical architecture of vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5950–5960, 2023.
- Towards stabilizing batch statistics in backward propagation of batch normalization. arXiv preprint arXiv:2001.06838, 2020.
- Unified normalization for accelerating and stabilizing transformers. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 4445–4455, 2022.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- Leveraging batch normalization for vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 413–422. IEEE, 2021.
- Less is more: Focus attention for efficient detr. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6674–6683, 2023.
- Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10323–10333, 2023.
- Jialong Guo (6 papers)
- Xinghao Chen (66 papers)
- Yehui Tang (63 papers)
- Yunhe Wang (145 papers)