Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lightweight Vision Transformer with Cross Feature Attention (2207.07268v2)

Published 15 Jul 2022 in cs.CV

Abstract: Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Once for all: Train one network and specialize it for efficient deployment. ArXiv, abs/1908.09791, 2020.
  2. Proxylessnas: Direct neural architecture search on target task and hardware. ArXiv, abs/1812.00332, 2019.
  3. Crossvit: Cross-attention multi-scale vision transformer for image classification. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 347–356, 2021.
  4. Addernet: Do we really need multiplications in deep learning? 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1465–1474, 2020.
  5. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  6. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  7. Searching the search space of vision transformer. In NeurIPS, 2021.
  8. Mobile-former: Bridging mobilenet and transformer. ArXiv, abs/2108.05895, 2021.
  9. Generating long sequences with sparse transformers. ArXiv, abs/1904.10509, 2019.
  10. Multi-column deep neural networks for image classification. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3642–3649, 2012.
  11. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  12. The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
  13. Randaugment: Practical automated data augmentation with a reduced search space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3008–3017, 2020.
  14. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021.
  15. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  16. Exploiting linear structure within convolutional networks for efficient evaluation. ArXiv, abs/1404.0736, 2014.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2021.
  18. Centernet: Keypoint triplets for object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6568–6577, 2019.
  19. Xcit: Cross-covariance image transformers. ArXiv, abs/2106.09681, 2021.
  20. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks : the official journal of the International Neural Network Society, 107:3–11, 2018.
  21. You only look at one sequence: Rethinking transformer in vision through object detection. In NeurIPS, 2021.
  22. Yolox: Exceeding yolo series in 2021. ArXiv, abs/2107.08430, 2021.
  23. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv: Computer Vision and Pattern Recognition, 2016.
  24. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  25. Gaussian error linear units (gelus). arXiv: Learning, 2016.
  26. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  27. Searching for mobilenetv3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1314–1324, 2019.
  28. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv, abs/1704.04861, 2017.
  29. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. ArXiv, abs/1607.03250, 2016.
  30. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:2011–2023, 2020.
  31. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
  32. Reformer: The efficient transformer. ArXiv, abs/2001.04451, 2020.
  33. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60:84 – 90, 2012.
  34. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998.
  35. Exploring plain vision transformer backbones for object detection. ArXiv, abs/2203.16527, 2022.
  36. Microsoft coco: Common objects in context. In ECCV, 2014.
  37. Darts: Differentiable architecture search. ArXiv, abs/1806.09055, 2019.
  38. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  39. Fully convolutional networks for semantic segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  40. Decoupled weight decay regularization. In ICLR, 2019.
  41. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  42. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. ArXiv, abs/2110.02178, 2021.
  43. Large-scale deep unsupervised learning using graphics processors. In ICML ’09, 2009.
  44. Yolov3: An incremental improvement. ArXiv, abs/1804.02767, 2018.
  45. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2015.
  46. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  47. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
  48. Segmenter: Transformer for semantic segmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7242–7252, 2021.
  49. Revisiting unreasonable effectiveness of data in deep learning era. 2017 IEEE International Conference on Computer Vision (ICCV), pages 843–852, 2017.
  50. Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv, abs/1905.11946, 2019.
  51. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  52. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  53. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43:3349–3364, 2021.
  54. Linformer: Self-attention with linear complexity. ArXiv, abs/2006.04768, 2020.
  55. Addernet and its minimalist hardware design for energy-efficient artificial intelligence. ArXiv, abs/2101.10015, 2021.
  56. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In NeurIPS, 2021.
  57. Early convolutions help transformers see better. In NeurIPS, 2021.
  58. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  59. Lite vision transformer with enhanced self-attention. ArXiv, abs/2112.10809, 2021.
  60. Learning from multiple teacher networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.
  61. mixup: Beyond empirical risk minimization. ArXiv, abs/1710.09412, 2018.
  62. Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
  63. Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, 2017.
  64. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6877–6886, 2021.
  65. Random erasing data augmentation. In AAAI, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Youpeng Zhao (16 papers)
  2. Huadong Tang (3 papers)
  3. Yingying Jiang (10 papers)
  4. Yong A (8 papers)
  5. Qiang Wu (154 papers)
Citations (9)