Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window (2306.13776v1)
Abstract: Transformer models have shown great potential in computer vision, following their success in language tasks. Swin Transformer is one of them that outperforms convolution-based architectures in terms of accuracy, while improving efficiency when compared to Vision Transformer (ViT) and its variants, which have quadratic complexity with respect to the input size. Swin Transformer features shifting windows that allows cross-window connection while limiting self-attention computation to non-overlapping local windows. However, shifting windows introduces memory copy operations, which account for a significant portion of its runtime. To mitigate this issue, we propose Swin-Free in which we apply size-varying windows across stages, instead of shifting windows, to achieve cross-connection among local windows. With this simple design change, Swin-Free runs faster than the Swin Transformer at inference with better accuracy. Furthermore, we also propose a few of Swin-Free variants that are faster than their Swin Transformer counterparts.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021.
- Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
- Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Swin transformer V2: scaling up capacity and resolution. CoRR, abs/2111.09883, 2021.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- Improving language understanding by generative pre-training. 2018.
- Tom B. Brown et al. Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
- SegFormer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems 34 pre-proceedings (NeurIPS), 2021.
- Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways, 2022.
- NVIDIA TensorRT. https://developer.nvidia.com/tensorrt.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.
- Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860, 2019.
- End-to-end object detection with transformers. CoRR, abs/2005.12872, 2020.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, abs/2012.15840, 2020.
- ONNX: Open neural network exchange. https://github.com/onnx/onnx, 2019.
- Depth estimation with simplified transformer, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
- ThanatosShinji. onnx-tool. https://github.com/ThanatosShinji/onnx-tool, 2023.
- Jinkyu Koo (4 papers)
- John Yang (22 papers)
- Le An (9 papers)
- Gwenaelle Cunha Sergio (4 papers)
- Su Inn Park (3 papers)