ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers (2208.13138v1)

Published 28 Aug 2022 in cs.CV

Abstract: Although Transformers have successfully transitioned from their LLMling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2\% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Yutong Xie (68 papers)
Jianpeng Zhang (35 papers)
Yong Xia (141 papers)
Anton van den Hengel (188 papers)
Qi Wu (323 papers)

Citations (4)

View on Semantic Scholar

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers (2208.13138v1)

Related Papers