Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Green Hierarchical Vision Transformer for Masked Image Modeling (2205.13515v2)

Published 26 May 2022 in cs.CV and cs.LG

Abstract: We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lang Huang (16 papers)
  2. Shan You (46 papers)
  3. Mingkai Zheng (19 papers)
  4. Fei Wang (574 papers)
  5. Chen Qian (226 papers)
  6. Toshihiko Yamasaki (74 papers)
Citations (59)

Summary

Efficient Masked Image Modeling with Hierarchical Vision Transformers

The paper presents an innovative approach for enhancing the efficiency of Masked Image Modeling (MIM) by employing hierarchical Vision Transformers (ViTs). Unlike traditional methods that process entire image patches, this strategy selects only the visible parts of the image, reducing computational resources significantly. This method introduces several key innovations, focusing on applying MIM to hierarchical architectures, such as the Swin and Twins Transformers, while preserving performance.

Central to this approach are three proposed mechanisms. Firstly, the Group Window Attention technique addresses the quadratic complexity of self-attention by partitioning local windows into evenly sized groups, applying attention only within each group. Secondly, a Dynamic Programming (DP) algorithm determines the most cost-effective grouping to minimize attention-related computation. Thirdly, the paper proposes replacing standard convolutions in ViTs with Sparse Convolution layers to handle sparse data more efficiently.

The primary motivation for this research is the success of Masked LLMing in NLP, which inspired the adaptation of a similar strategy in computer vision, termed MIM. The representative method, Masked Autoencoder (MAE), albeit effective, is traditionally limited to isotropic ViT architectures. This paper aims to overcome these limitations by integrating MIM with hierarchical models. Hierarchical structures, which are prevalent in current vision models for handling visual scale variations, are now enabled for efficient MIM operations through these innovations.

The experimental results showcase the effectiveness of this method with several key findings. The proposed approach allows hierarchical ViTs to achieve a training speedup of up to 2.7 times with a GPU memory reduction of 70%. The models trained using this method maintain competitive performance on the ImageNet classification and surpass conventional models in the COCO object detection task. Specifically, the paper reports top-1 fine-tuning accuracy of 83.9% with Twins-L and 85.1% with Swin-L on ImageNet, demonstrating its effectiveness compared to traditional methods.

A notable aspect of this research is its alignment with principles of "Green AI," emphasizing efficiency and reduced environmental impact. By significantly lowering computation demands, this approach not only democratizes access to state-of-the-art self-supervised training techniques but also underscores the importance of sustainable AI research.

The implications of this work are significant for both the practical deployment of vision models and theoretical advancements in efficient model training. The reduction in computational burden has implications for democratizing access to high-performance models, making advanced AI training feasible with significantly fewer resources.

Future directions may include expanding this methodology to other areas where sparse data representations are applicable. Furthermore, exploring the integration of this “green” methodology with other emerging architectures not directly addressed in this paper could extend its utility.

In summary, this paper's contributions mark a significant advancement in the efficient training of hierarchical ViTs, offering a practical and sustainable method for MIM that promises to influence both current practices and future research directions in artificial intelligence.