Scalable Vision Transformers with Hierarchical Pooling (2103.10619v2)

Published 19 Mar 2021 in cs.CV

Abstract: The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets. Code is available at https://github.com/MonashAI/HVT

Citations (120)

View on Semantic Scholar

Summary

An Analysis of Scalable Vision Transformers with Hierarchical Pooling

The paper "Scalable Vision Transformers with Hierarchical Pooling" presents a methodological advance in the domain of Visual Transformers, proposing a more efficient design that leverages hierarchical pooling. Traditionally, Vision Transformers (ViT) have been utilized without pooling, maintaining a consistent sequence length throughout the depth of the network. This work explores the integration of pooling mechanisms to mitigate computational complexity issues and enhance model scalability.

Hierarchical Pooling and its Impact

The central innovation of the paper lies in the introduction of Hierarchical Visual Transformer (HVT) models, which employ hierarchical pooling to progressively reduce the token sequence length as the network deepens. This approach mirrors the feature map downsampling in Convolutional Neural Networks (CNNs). The benefits are multi-fold: reduced computational demands allow for enhancements in model dimensions, such as depth, width, and resolution, without additional computational overhead. The numerical experiments exhibit significant computational efficiency and demonstrate that HVT outperforms comparable baselines on benchmarks like ImageNet and CIFAR-100 with a similar number of floating-point operations (FLOPs).

Technical Contributions and Claims

The authors highlight several technical contributions:

Hierarchical Pooling: The paper effectively introduces a pooling strategy that systematically shrinks the sequence length across layers, yielding a pyramidal feature hierarchy within the Transformer architecture. This structural adjustment is shown to enhance both efficiency and discriminative capacity.
Token Utilization: HVT models utilize pooled visual tokens for class prediction over the traditional class token, leading to superior performance. This is empirically supported by the observed richer discriminative patterns in average pooled tokens.

Experimental Evaluation

The evaluation is thorough, with experiments conducted to compare the performance of HVT against established models such as DeiT and PoWER-BERT across datasets like ImageNet and CIFAR-100. The results are notably strong, with the proposed method achieving a 3.03% increase in Top-1 accuracy over the baseline models on ImageNet when adjustments to model dimensions are applied while maintaining computational comparability.

Practical and Theoretical Implications

The research holds significant implications for the design and deployment of large-scale vision models. Practically, the reduced computational costs facilitate the use of Transformer architectures on resource-constrained devices, broadening their applicability. Theoretically, the insight into sequence pooling in Transformers may inspire further studies into balancing efficiency with model capacity, potentially influencing future architectural designs across different computational domains in AI.

Future Directions

The scalability of HVT across dimensions presents avenues for future exploration. As this paper only targets the encoder's design, extending the hierarchical pooling mechanism to decoder architectures could enhance its applicability to complex tasks such as object detection and segmentation. Additionally, investigating principled scaling strategies may yield optimal configurations for enhanced performance and efficiency.

Conclusion

The paper contributes a novel approach to improving both the computational efficiency and scalability of Vision Transformers via hierarchical pooling. This methodological innovation addresses the limitations of current ViT models, offering a viable path forward for developing scalable vision systems that do not compromise on speed or performance. The insights garnered from this research may profoundly influence the future trajectory of transformer-based architectures in computer vision.