Vision Transformers with Hierarchical Attention (2106.03180v5)

Published 6 Jun 2021 in cs.CV

Abstract: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

PDF HTML Abstract

Vision Transformers with Hierarchical Attention

The paper "Vision Transformers with Hierarchical Attention" introduces a novel approach to the challenge of computational and space complexity inherent in vision transformers, particularly due to the Multi-Head Self-Attention (MHSA) mechanism. Traditional vision transformers such as ViT present remarkable capabilities in modeling global dependencies using MHSA, but their practicality in vision tasks is often limited by these resource constraints. The authors propose a hierarchical approach to MHSA, termed Hierarchical MHSA (H-MHSA), which reduces this computational burden while maintaining the ability to model both local and global token relationships effectively.

Methodology and Contributions

The core contribution of the paper is the development of H-MHSA, which decomposes the attention computation into a hierarchical process. This strategy begins by dividing the input image into patches, treating each patch as a token. The attention computation is then performed hierarchically; first, local relationships are modeled within small grids of patches. Next, these grids are combined, and attention is computed on the larger, merged tokens to capture global dependencies. The critical innovation lies in using a limited number of tokens at each step, greatly reducing the computational effort compared to traditional MHSA.

The authors build upon this mechanism to construct the Hierarchical-Attention-based Transformer Networks (HAT-Net). HAT-Net benefits from H-MHSA's ability to concurrently model both global and local dependencies, crucial for comprehensive scene understanding. This dual modeling approach is posited to be more effective than previous architectures, such as Swin Transformer, which primarily emphasizes local dependencies, or PVT, which mainly focuses on global dependencies at the cost of local details.

Experimental Results

HAT-Net's capabilities are extensively validated across a suite of vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. In image classification on the ImageNet dataset, HAT-Net demonstrates superior performance compared to contemporary architectures with similar computational resources. The paper highlights improvements in accuracy metrics across all tested variants of HAT-Net (Tiny, Small, Medium, Large).

For semantic segmentation, experiments on the ADE20K dataset show that HAT-Net consistently outperforms both convolutional and transformer-based networks, underlining its efficacy in dense prediction tasks. Similarly, in object detection and instance segmentation tasks on the MS-COCO dataset, HAT-Net achieves significant performance gains in both bounding box metrics and segmentation metrics, demonstrating its robust feature representation learning abilities.

Implications and Future Work

The introduction of H-MHSA represents an important step in refining the efficiency of vision transformers, marrying the strengths of both locality and globality in vision tasks. By introducing a flexible and computationally efficient approach, the paper opens pathways for broader application of transformers in real-world scenarios, particularly where computational resources are a constraint.

In terms of future work, the authors suggest that the H-MHSA approach could inspire new design paradigms for vision transformers, potentially influencing architecture decisions across various scales and facilitating the development of more efficient models for both small-scale and large-scale applications.

The research sets a precedent for future transformer architectures, emphasizing the importance of hierarchical attention mechanisms that balance comprehensive modeling of scene elements while addressing computational constraints. Through thoughtful experimentation and precise design choices, this paper contributes to the ongoing evolution of efficient and effective vision transformers.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yun Liu (213 papers)
Yu-Huan Wu (13 papers)
Guolei Sun (31 papers)
Le Zhang (180 papers)
Ajad Chhatkuli (25 papers)
Luc Van Gool (569 papers)

Citations (22)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - yun-liu/HAT-Net: Vision Transformers with Hierarchical Attention (86 stars)

Tweets

https://twitter.com/gastronomy/status/1746720873569046852