Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SepViT: Separable Vision Transformer (2203.15380v4)

Published 29 Mar 2022 in cs.CV

Abstract: Vision Transformers have witnessed prevailing success in a series of vision tasks. However, these Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. To alleviate this issue, we draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to compute the attention relationship among windows with negligible cost and establish long-range visual interactions across multiple windows, respectively. Extensive experiments on general-purpose vision benchmarks demonstrate that SepViT can achieve a state-of-the-art trade-off between performance and latency. Among them, SepViT achieves 84.2% top-1 accuracy on ImageNet-1K classification while decreasing the latency by 40%, compared to the ones with similar accuracy (e.g., CSWin). Furthermore, SepViT achieves 51.0% mIoU on ADE20K semantic segmentation task, 47.9 AP on the RetinaNet-based COCO detection task, 49.4 box AP and 44.6 mask AP on Mask R-CNN-based COCO object detection and instance segmentation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. End-to-end object detection with transformers. In European Conference on Computer Vision, 213–229.
  2. Glit: Neural architecture search for global and local image transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12–21.
  3. Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689.
  4. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12270–12280.
  5. Mobile-former: Bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895.
  6. Twins: Revisiting the design of spatial attention in vision transformers. arXiv preprint arXiv:2104.13840.
  7. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882.
  8. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1601–1610.
  9. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  11. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12259–12269.
  12. Cmt: Convolutional neural networks meet vision transformers. arXiv preprint arXiv:2107.06263.
  13. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1580–1589.
  14. Transformer in transformer. Advances in Neural Information Processing Systems, 34.
  15. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, 2961–2969.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
  17. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1314–1324.
  18. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
  19. Deep networks with stochastic depth. In European Conference on Computer Vision, 646–661.
  20. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650.
  21. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6399–6408.
  22. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.
  23. Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12281–12291.
  24. Neural architecture search with a lightweight transformer for text-to-image synthesis. IEEE Transactions on Network Science and Engineering, 1567–1576.
  25. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, 2980–2988.
  26. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 740–755.
  27. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030.
  28. A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.
  29. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  30. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision, 116–131.
  31. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178.
  32. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211–252.
  33. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520.
  34. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
  35. The evolved transformer. In International Conference on Machine Learning, 5877–5886.
  36. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 6105–6114.
  37. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357.
  38. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5463–5474.
  39. Pvtv2: Improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797.
  40. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122.
  41. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31.
  42. Unified perceptual parsing for scene understanding. In European Conference on Computer Vision, 418–434.
  43. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34.
  44. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500.
  45. Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9981–9990.
  46. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10819–10829.
  47. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 558–567.
  48. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6848–6856.
  49. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890.
  50. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 633–641.
  51. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
  52. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wei Li (1121 papers)
  2. Xing Wang (191 papers)
  3. Xin Xia (171 papers)
  4. Jie Wu (230 papers)
  5. Jiashi Li (22 papers)
  6. Xuefeng Xiao (51 papers)
  7. Min Zheng (32 papers)
  8. Shiping Wen (17 papers)
Citations (35)