Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization (2405.11582v2)

Published 19 May 2024 in cs.CV and cs.CL

Abstract: Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for LLMing task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
  3. Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, pp.  35–49. Springer, 2022.
  4. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv preprint arXiv:2205.14756, 2022.
  5. End-to-end object detection with transformers. In European conference on computer vision, pp.  213–229. Springer, 2020.
  6. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  7. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  9. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12124–12134, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5961–5971, 2023.
  12. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  13. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  14. Densemamba: State space models with dense hidden connection for efficient large language models. arXiv preprint arXiv:2403.00818, 2024.
  15. Huawei. Mindspore. https://www.mindspore.cn/, 2020.
  16. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. pmlr, 2015.
  17. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
  18. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2682–2690, 2019.
  19. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  17227–17236, 2023.
  20. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  22. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  23. Category feature transformer for semantic segmentation. arXiv preprint arXiv:2308.05581, 2023a.
  24. Dynamic token pruning in plain vision transformers for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  777–786, 2023b.
  25. A survey on transformer compression. arXiv preprint arXiv:2402.05964, 2024.
  26. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR, 2021.
  27. Maxvit: Multi-axis vision transformer. In European conference on computer vision, pp.  459–479. Springer, 2022.
  28. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  568–578, 2021.
  31. Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12186–12195, 2022.
  32. PPT: Token pruning and pooling for efficient vision transformers. arXiv preprint arXiv:2310.01812, 2023.
  33. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp.  3–19, 2018.
  34. Dat++: Spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430, 2023.
  35. Fdvit: Improve the hierarchical architecture of vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5950–5960, 2023.
  36. Towards stabilizing batch statistics in backward propagation of batch normalization. arXiv preprint arXiv:2001.06838, 2020.
  37. Unified normalization for accelerating and stabilizing transformers. In Proceedings of the 30th ACM International Conference on Multimedia, pp.  4445–4455, 2022.
  38. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  39. Leveraging batch normalization for vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp.  413–422. IEEE, 2021.
  40. Less is more: Focus attention for efficient detr. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6674–6683, 2023.
  41. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10323–10333, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jialong Guo (6 papers)
  2. Xinghao Chen (66 papers)
  3. Yehui Tang (63 papers)
  4. Yunhe Wang (145 papers)
Citations (5)

Summary

Understanding SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Batch Normalization

Introduction

Transformers have been game changers in both NLP and computer vision. However, their significant computational demands make it difficult to use them on resource-constrained devices. The paper we're looking into today tackles this issue by focusing on optimization within transformers, specifically the computation-heavy normalization layers and attention modules.

Key Innovations

1. Progressive Re-parameterized BatchNorm (PRepBN)

Why it Matters:

Layer Normalization (LayerNorm) is standard in transformers but isn't computationally friendly due to real-time statistic calculations during inference. The alternative, BatchNorm, usually leads to performance issues when used in transformers. PRepBN is designed to address these limitations.

How it Works:

  • Progressive Strategy: This method gradually transitions from LayerNorm to BatchNorm during training. Initially, the model relies on LayerNorm, which provides stability, and over time shifts to BatchNorm, which is faster during inference.
  • Re-parameterized BatchNorm: To further stabilize training, PRepBN introduces a parameter that modulates the BatchNorm, enhancing training stability.

Results Highlight:

PRepBN is shown to be powerful for both image classification and object detection. For instance, SLAB-Swin achieves 83.6% top-1 accuracy on ImageNet-1K with a latency of 16.2ms, outperforming the previous Flatten-Swin in both speed and accuracy.

2. Simplified Linear Attention (SLA)

Why it Matters:

The traditional attention mechanism in transformers is computationally expensive due to its quadratic complexity. Linear attention aims to make these calculations more efficient.

How it Works:

  • Simplification: SLA uses ReLU as the kernel function and incorporates a depth-wise convolution for local feature enhancement. This approach is simpler and more efficient.
  • Decoupling: By splitting the calculations in a specific way, SLA effectively reduces the computational complexity while maintaining performance.

Results Highlight:

On various benchmarks, the SLAB transformer equipped with SLA outperformed existing models. It achieves significant latency reductions while maintaining comparable accuracy levels.

Broader Implications

Theoretical Implications

  • Normalization Strategy: The success of PRepBN could pave the way for more advanced normalization techniques that offer a balance between computational efficiency and model stability.
  • Attention Mechanisms: SLA's effectiveness suggests that research into simpler, linear attention mechanisms could be a fertile ground for further innovations.

Practical Implications

  • Scalability: These optimizations mean that powerful transformers can be deployed on less powerful hardware.
  • Efficiency Gains: Industries can use these techniques to reduce operational costs related to computational resources, making sophisticated models more accessible.

Future Directions

One can speculate several intriguing avenues for future research and practical implementations:

  1. Adaptation to Various Domains: While the paper mainly focuses on vision and LLMs, future research could adapt these techniques to other fields such as reinforcement learning or time-series analysis.
  2. Hybrid Models: Combining PRepBN and SLA with other efficiency techniques could yield even more scalable transformers.
  3. Further Optimization: There is always room for more fine-tuning in the progressive transition strategy and the linear attention mechanism to achieve better performance.

Conclusion

The improvements proposed in this paper, namely the Progressive Re-parameterized BatchNorm and Simplified Linear Attention, demonstrate tangible advancements in making transformers more efficient without sacrificing performance. By carefully rethinking normalization layers and attention modules, the researchers have paved the way for more accessible and scalable transformer architectures. The discussed techniques are not just incremental improvements but foundational steps that make efficient transformers feasible in real-world, resource-constrained environments. This is indeed an exciting step forward for the deployment of advanced AI models in broader applications.