Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision Transformer with Super Token Sampling (2211.11167v2)

Published 21 Nov 2022 in cs.CV

Abstract: Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized, which sacrifice the capacity to capture long-range dependency. A challenge then arises: can we access efficient and effective global context modeling at the early stages of a neural network? To address this issue, we draw inspiration from the design of superpixels, which reduces the number of image primitives in subsequent processing, and introduce super tokens into vision transformer. Super tokens attempt to provide a semantically meaningful tessellation of visual content, thus reducing the token number in self-attention as well as preserving global modeling. Specifically, we propose a simple yet strong super token attention (STA) mechanism with three steps: the first samples super tokens from visual tokens via sparse association learning, the second performs self-attention on super tokens, and the last maps them back to the original token space. STA decomposes vanilla global attention into multiplications of a sparse association map and a low-dimensional attention, leading to high efficiency in capturing global dependencies. Based on STA, we develop a hierarchical vision transformer. Extensive experiments demonstrate its strong performance on various vision tasks. In particular, without any extra training data or label, it achieves 86.4% top-1 accuracy on ImageNet-1K with less than 100M parameters. It also achieves 53.9 box AP and 46.8 mask AP on the COCO detection task, and 51.9 mIOU on the ADE20K semantic segmentation task. Code is released at https://github.com/hhb072/STViT.

Vision Transformer with Super Token Sampling: An Analytical Examination

The paper "Vision Transformer with Super Token Sampling" elucidates an innovative approach to enhancing the computational efficiency and global contextual modeling of Vision Transformers (ViTs). The authors introduce a novel mechanism termed Super Token Attention (STA) that amalgamates the concept of superpixels from image processing with transformers' attention framework to address redundancy issues inherent in capturing local features.

Motivation and Approach

The existing challenge with Vision Transformers arises from the quadratic complexity of self-attention, particularly in high-resolution visual tasks. This substantial computational burden often results in redundancy, especially in the early layers of the network where local features predominate. The authors propose Super Tokens, a form of spatial reduction inspired by superpixels that aim to semantically tessellate the visual content, effectively reducing token numbers without forfeiting global context.

Super Token Attention Mechanism

The core innovation of this work is the Super Token Attention (STA), a three-step process involving:

  1. Super Token Sampling (STS): Visual tokens are aggregated into fewer super tokens via sparse association learning. This step reduces redundancy and lowers computational demands.
  2. Self-Attention: The reduced set of super tokens undergoes self-attention, enabling the model to capture long-range dependencies more efficiently.
  3. Token Upsampling: The resulting attention-optimized super tokens are mapped back to the original token space, allowing for seamless integration into downstream tasks.

STA cleverly decomposes the conventional global attention mechanism into sparse, low-dimensional multiplicative operations, significantly enhancing efficiency.

Empirical Results

Through extensive empirical validation, the paper demonstrates the efficacy of STA within hierarchical Vision Transformers (STViT) across multiple vision tasks:

  • Image Classification: Achieving 86.4% top-1 accuracy on ImageNet-1K, STViT outperforms contemporaneous models, showing competitive performance with significantly lower FLOPs.
  • Object Detection and Instance Segmentation: The introduction of STViT yields robust results, with metrics peaking at 53.9 box AP and 46.8 mask AP on the COCO dataset, surpassing previous benchmarks.
  • Semantic Segmentation: The method reports a mean Intersection over Union (mIoU) score of 51.9 on ADE20K, verifying its effectiveness in capturing spatial semantics with reduced computational overhead.

Implications and Future Prospects

The introduction of Super Tokens is a compelling augmentation to the standard transformer paradigm, offering a pathway to improved efficiency without sacrificing modeling capacity. This has meaningful implications for deploying transformers in resource-constrained environments or real-time applications.

The theoretical implications suggest potential areas for further development, such as enhancing the robustness of STA to various image scales or integrating it with other efficient transformer architectures. Advancements might also explore broader use cases outside of traditional vision tasks, leveraging the inherent efficiency gains offered by this approach.

In conclusion, this paper contributes substantively to the discourse on transformer efficiency, providing a practical mechanism to balance the demands of computational complexity with the necessity of capturing nuanced global contexts in visual data. The Super Token Attention strategy exemplifies the progressive trajectory of neural architecture research towards more scalable and adaptable frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Slic superpixels compared to state-of-the-art superpixel methods. TPAMI, 34(11):2274–2282, 2012.
  2. Superpixels and polygons using simple non-iterative clustering. In CVPR, pages 4651–4660, 2017.
  3. Revisiting superpixels for active learning in semantic segmentation with realistic annotation costs. In CVPR, pages 10988–10997, 2021.
  4. Cascade r-cnn: Delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
  5. End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
  6. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  7. Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS, 2021.
  8. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.
  9. MMSegmentation Contributors. Mmsegmentation, an open source semantic segmentation toolbox, 2020.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pages 702–703, 2020.
  11. Coatnet: Marrying convolution and attention for all data sizes. 34:3965–3977, 2021.
  12. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Efficient graph-based image segmentation. IJCV, 59(2):167–181, 2004.
  17. Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022.
  18. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  19. Deep residual learning for image recognition. pages 770–778, 2016.
  20. Deep networks with stochastic depth. In ECCV, pages 646–661, 2016.
  21. Orthogonal transformer: An efficient vision transformer backbone with token orthogonalization. In NeurIPS, 2022.
  22. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
  23. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pages 603–612, 2019.
  24. Superpixel sampling networks. In ECCV, pages 352–368, 2018.
  25. Reformer: The efficient transformer. In ICLR, 2020.
  26. Mpvit: Multi-path vision transformer for dense prediction. In CVPR, pages 7287–7296, 2022.
  27. Uniformer: Unified transformer for efficient spatiotemporal representation learning. In ICLR, 2022.
  28. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pages 4804–4814, 2022.
  29. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  30. Entropy rate superpixel segmentation. In CVPR, pages 2097–2104, 2011.
  31. Manifold slic: A fast method to compute content-sensitive superpixels. In CVPR, pages 651–659, 2016.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  33. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  34. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  35. Do vision transformers see like convolutional neural networks? In NeurIPS, volume 34, pages 12116–12128, 2021.
  36. Shunted self-attention via multi-scale token aggregation. In CVPR, pages 10853–10862, 2022.
  37. Learning a classification model for segmentation. In ICCV, volume 2, pages 10–10, 2003.
  38. Self-attention with relative position representations. In NAACL-HLT (2), 2018.
  39. Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
  40. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
  41. Going deeper with image transformers. In ICCV, pages 32–42, 2021.
  42. Learning superpixels with segmentation-aware affinity loss. In CVPR, pages 568–576, 2018.
  43. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  44. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
  45. Crossformer: A versatile vision transformer hinging on cross-scale attention. In ICLR, 2022.
  46. End-to-end video instance segmentation with transformers. In CVPR, pages 8741–8750, 2021.
  47. Cvt: Introducing convolutions to vision transformers. In ICCV, pages 22–31, 2021.
  48. Vision transformer with deformable attention. In CVPR, pages 4794–4803, 2022.
  49. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
  50. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
  51. Superpixel segmentation with fully convolutional networks. In CVPR, pages 13964–13973, 2020.
  52. Focal self-attention for local-global interactions in vision transformers. In NeurIPS, 2021.
  53. Superpixel-based tracking-by-segmentation using markov chains. In CVPR, pages 1812–1821, 2017.
  54. Glance-and-gaze vision transformer. NeurIPS, 34, 2021.
  55. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pages 558–567, 2021.
  56. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023–6032, 2019.
  57. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, pages 11101–11111, 2022.
  58. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  59. Improved transformer for high-resolution gans. NeurIPS, 34, 2021.
  60. End-to-end object detection with adaptive clustering transformer. In BMVC, 2021.
  61. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pages 6881–6890, 2021.
  62. Random erasing data augmentation. In AAAI, volume 34, pages 13001–13008, 2020.
  63. Semantic understanding of scenes through the ade20k dataset. IJCV, 127(3):302–321, 2019.
  64. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Huaibo Huang (58 papers)
  2. Xiaoqiang Zhou (11 papers)
  3. Jie Cao (79 papers)
  4. Ran He (172 papers)
  5. Tieniu Tan (119 papers)
Citations (42)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

  1. GitHub - hhb072/STViT (142 stars)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com