Learning Correlation Structures for Vision Transformers (2404.03924v1)
Abstract: We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.
- Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
- Bringing image scene structure to video via frame-clip consistency of object tokens. arXiv preprint arXiv:2206.06346, 2022.
- Gait recognition using image self-similarity. EURASIP Journal on Advances in Signal Processing, 2004(4):1–14, 2004.
- Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095, 2021.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Space-time mixing attention for video transformer. NeurIPS, 34:19594–19607, 2021.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in neural information processing systems, 34:9355–9366, 2021a.
- Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021b.
- On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584, 2019.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34:3965–3977, 2021.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, pages 12124–12134, 2022.
- Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Multiscale vision transformers. arXiv preprint arXiv:2104.11227, 2021.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In CVPR, 2020.
- Slowfast networks for video recognition. In ICCV, 2019.
- Partial success in closing the gap between human and machine vision. In NeurIPS, 2021.
- The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
- Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022.
- Scnet: Learning semantic correspondence. In ICCV, 2017.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Object-region video transformers. In CVPR, pages 3148–3159, 2022.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Local relation networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3464–3473, 2019.
- Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
- All tokens matter: Token labeling for training better vision transformers. NeurIPS, 34:18590–18602, 2021.
- Relational embedding for few-shot classification. In ICCV, pages 8822–8833, 2021.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
- Relational self-attention: What’s missing in attention for video understanding. NeurIPS, 34:8046–8059, 2021.
- Fcss: Fully convolutional self-similarity for dense semantic correspondence. In CVPR, 2017.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
- Movinets: Mobile video networks for efficient video recognition. In CVPR, pages 16020–16030, 2021.
- Motionsqueeze: Neural motion feature learning for video understanding. arXiv preprint arXiv:2007.09933, 2020.
- Learning self-similarity in space and time as generalized motion for action recognition. arXiv preprint arXiv:2102.07092, 2021.
- {CT}-net: Channel tensorization network for video classification. In ICPR, 2021.
- Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022a.
- Resound: Towards action recognition without representation bias. In ECCV, 2018.
- Tea: Temporal excitation and aggregation for action recognition. In CVPR, 2020.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pages 4804–4814, 2022b.
- Dualformer: Local-global stratified transformer for efficient video recognition. arXiv preprint arXiv:2112.04674, 2021.
- Tsm: Temporal shift module for efficient video understanding. In ICCV, 2019.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Learning video representations from correspondence proposals. In CVPR, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022a.
- Video swin transformer. In CVPR, pages 3202–3211, 2022b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235, 2018.
- Convolutional hough matching networks. In CVPR, pages 2940–2950, 2021.
- When does label smoothing help? NeurIPS, 32, 2019.
- Keeping your eye on the ball: Trajectory attention in video transformers. NeurIPS, 34:12493–12506, 2021.
- Bilinear classifiers for visual recognition. NeurIPS, 22, 2009.
- Do vision transformers see like convolutional neural networks? NeurIPS, 34:12116–12128, 2021.
- Stand-alone self-attention in vision models. In NeurIPS, 2019.
- Attentive semantic alignment with offset-aware correlation kernels. In ECCV, 2018.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020.
- Matching local self-similarities across images and videos. In CVPR, 2007.
- Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
- Videobert: A joint model for video and language representation learning. In ICCV, pages 7464–7473, 2019.
- Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
- Efficientnetv2: Smaller models and faster training. In ICML, pages 10096–10106. PMLR, 2021.
- Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
- Video classification with channel-separated convolutional networks. In ICCV, 2019.
- Are convolutional neural networks or transformers more like human vision? arXiv preprint arXiv:2105.07197, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Video modeling with correlation networks. In CVPR, 2020.
- Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
- Tdn: Temporal difference networks for efficient action recognition. In CVPR, pages 1895–1904, 2021a.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021b.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
- Non-local neural networks. In CVPR, 2018.
- Cvt: Introducing convolutions to vision transformers. In ICCV, pages 22–31, 2021.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
- Multiview transformers for video recognition. In CVPR, pages 3333–3343, 2022.
- Volumetric correspondence networks for optical flow. In NeurIPS, 2019.
- Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021.
- Volo: Vision outlooker for visual recognition. IEEE TPAMI, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023–6032, 2019.
- Temporal query networks for fine-grained video understanding. In CVPR, pages 4486–4496, 2021.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, pages 13001–13008, 2020.
- Temporal relational reasoning in videos. In ECCV, 2018.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.