Efficient Attention: Attention with Linear Complexities (1812.01243v10)
Abstract: Dot-product attention has wide applications in computer vision and natural language processing. However, its memory and computational costs grow quadratically with the input size. Such growth prohibits its application on high-resolution inputs. To remedy this drawback, this paper proposes a novel efficient attention mechanism equivalent to dot-product attention but with substantially less memory and computational costs. Its resource efficiency allows more widespread and flexible integration of attention modules into a network, which leads to better accuracies. Empirical evaluations demonstrated the effectiveness of its advantages. Efficient attention modules brought significant performance boosts to object detectors and instance segmenters on MS-COCO 2017. Further, the resource efficiency democratizes attention to complex models, where high costs prohibit the use of dot-product attention. As an exemplar, a model with efficient attention achieved state-of-the-art accuracies for stereo depth estimation on the Scene Flow dataset. Code is available at https://github.com/cmsflash/efficient-attention.
- Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV, 2019.
- Pyramid stereo matching network. In CVPR, 2018.
- Learning depth with convolutional spatial propagation network. arXiv preprint arXiv:1810.02695, 2018.
- Hierarchical neural architecture search for deep stereo matching. arXiv preprint arXiv:2010.13501, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, 2018.
- Cross attention network for few-shot classification. In NeurIPS, pages 4003–4014, 2019.
- Squeeze-and-excitation networks. In CVPR, 2018.
- Silco: Show a few images, localize the common object. In ICCV, pages 5067–5076, 2019.
- THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
- Non-locally enhanced encoder-decoder network for single image de-raining. In ACMMM, 2018.
- Video-based person re-identification via 3d convolutional networks and non-local attention. arXiv preprint arXiv:1807.05073, 2018.
- Feature pyramid networks for object detection. In CVPR, 2017.
- Non-local recurrent network for image restoration. arXiv preprint arXiv:1806.02919, 2018.
- Cascade residual learning: A two-stage convolutional neural network for stereo matching. In ICCV 2017 Workshops, 2017.
- Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377, 2019.
- Improving language understanding by generative pre-training. OpenAI Blog, 2018.
- Language models are unsupervised multitask learners. OpenAI Blog, 2019.
- Edgestereo: A context integrated residual pyramid network for stereo matching. In ACCV, 2018.
- Attention is all you need. In NIPS, 2017.
- Non-local neural networks. In CVPR, 2018.
- Cbam: Convolutional block attention module. In ECCV, 2018.
- R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.
- Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
- Compact generalized non-local network. In NeurIPS, 2018.
- Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
- LatentGNN: Learning efficient non-local relations for visual recognition. In ICML, 2019.
- Zhuoran Shen (4 papers)
- Mingyuan Zhang (41 papers)
- Haiyu Zhao (26 papers)
- Shuai Yi (45 papers)
- Hongsheng Li (340 papers)