Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MLP Can Be A Good Transformer Learner (2404.05657v1)

Published 8 Apr 2024 in cs.CV

Abstract: Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at https://github.com/sihaoevery/lambda_vit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
  3. Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821, 2021.
  4. On attention redundancy: A comprehensive study. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 930–945, 2021.
  5. Token merging: Your vit but faster. In Proceedings of ICLR, 2023.
  6. Psvit: Better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428, 2021.
  7. Diffrate: Differentiable compression rate for efficient vision transformers. arXiv preprint arXiv:2305.17997, 2023.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  10. Eventful transformers: Leveraging temporal redundancy in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16911–16923, 2023.
  11. Towards a deep and unified understanding of deep neural models in nlp. In International conference on machine learning, pages 2454–2463. PMLR, 2019.
  12. Aloft: A lightweight mlp-like architecture with dynamic low-frequency transform for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24132–24141, 2023.
  13. Learning to branch for multi-task learning. In International Conference on Machine Learning, pages 3854–3863, 2020.
  14. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13414–13423, 2023a.
  15. Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos. arXiv e-prints, pages arXiv–2312, 2023b.
  16. Learning multiple layers of features from tiny images. 2009.
  17. Not all patches are what you need: Expediting vision transformers via token reorganizations. In Proceedings of ICLR, 2022.
  18. Deep frequency filtering for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11797–11807, 2023.
  19. Towards the difficulty for a deep neural network to learn concepts of different complexities. In NeurIPS, 2023a.
  20. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14430, 2023b.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  22. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. In Proceedings of ICLR, 2018.
  23. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  24. Causality preserving chaotic transformation and classification using neurochaos learning. Advances in Neural Information Processing Systems, 35:2046–2058, 2022.
  25. Ia-red 2: Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34:24898–24911, 2021.
  26. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301–5310. PMLR, 2019.
  27. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
  28. Tinymim: An empirical study of distilling mim pre-trained models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3687–3697, 2023.
  29. Thomas Schreiber. Measuring information transfer. Physical review letters, 85(2):461, 2000.
  30. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.
  31. Entropy-driven mixed-precision quantization for deep network design. Advances in Neural Information Processing Systems, 35:21508–21520, 2022.
  32. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12165–12174, 2022.
  33. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Joint token pruning and squeezing towards more aggressive compression of vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2092–2101, 2023.
  36. Mask propagation for efficient video semantic segmentation. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  37. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2964–2972, 2022.
  38. Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018.
  39. Focal attention for long-range interactions in vision transformers. Advances in Neural Information Processing Systems, 34:30008–30022, 2021.
  40. Deep model reassembly. Advances in neural information processing systems, 35:25739–25753, 2022.
  41. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022.
  42. Transferable post-hoc calibration on pretrained transformers in noisy text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13940–13948, 2023.
  43. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  44. Concept-level explanation for the generalization of a dnn. arXiv preprint arXiv:2302.13091, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sihao Lin (9 papers)
  2. Pumeng Lyu (4 papers)
  3. Dongrui Liu (34 papers)
  4. Tao Tang (87 papers)
  5. Xiaodan Liang (318 papers)
  6. Andy Song (14 papers)
  7. Xiaojun Chang (148 papers)
Citations (7)