Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality (2205.10063v1)

Published 20 May 2022 in cs.CV

Abstract: Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM), successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that strictly samples $1$ random patch from each $2 \times 2$ grid, and a Secondary Masking (SM) which randomly masks a portion of (usually $25\%$) the already sampled regions as learnable tokens. US preserves equivalent elements across multiple non-overlapped local windows, resulting in the smooth support for popular Pyramid-based ViTs; whilst SM is designed for better transferable visual representations since US reduces the difficulty of pixel recovery pre-task that hinders the semantic learning. We demonstrate that UM-MAE significantly improves the pre-training efficiency (e.g., it speeds up and reduces the GPU memory by $\sim 2\times$) of Pyramid-based ViTs, but maintains the competitive fine-tuning performance across downstream tasks. For example using HTC++ detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only in ImageNet-1K can even outperform the one supervised in ImageNet-22K. The codes are available at https://github.com/implus/UM-MAE.

An Essay on "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality"

The paper "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality" addresses a critical challenge in the advancement of Masked AutoEncoder (MAE) techniques within self-supervised learning paradigms for computer vision tasks. The crux of the paper lies in adapting MAE, a self-supervised strategy highly effective with standard Vision Transformers (ViTs), for pyramid-based vision transformers which rely on "local" operations and windowed self-attention mechanisms. The authors propose a novel Uniform Masking (UM) strategy that comprises two primary components: Uniform Sampling (US) and Secondary Masking (SM), enabling efficient MAE pre-training for such pyramid-based architectures (e.g., PVT, Swin).

In the first substantive segment of the paper, the authors critique the limitation of MAE in the context of pyramid-based ViTs. The classical implementation of MAE thrives on the 'global' property of standard ViT architectures, which allow for arbitrary discrete image patches due to their expansive self-attention mechanism. However, pyramid-based ViTs, which optimally execute under local constraints, cannot directly leverage the random sequence of partial vision tokens without substantial adaptation. In light of this, the uniform sampling strategy provides a cohesive bridge, maintaining local window equilibrium across visual tokens by sampling uniformly from a preset spatial grid (such as 2×22 \times 2), thus marrying MAE-style efficiency with a locality-oriented design.

A pivotal feature of the Uniform Masking approach is how it tackles the disproportionate rendering of visible elements typically encountered in local windows. This challenge is overcome by the US protocol, which ensures equal element distribution per local window, and the Secondary Masking, which introduces a truncation of the easy-to-recover task in the pre-training phase by randomly masking additional parts of the sampled regions. Not only does this improve the sampling strategy for pyramid-based architectures, but it also leads to improved model learning robustness and better transferability to downstream tasks.

The practical implications of this work are underscored by the significant improvements UM-MAE brings in terms of computation. Tests benchmarked under configurations like the HTC++ detector showcase that models pre-trained with UM-MAE on ImageNet-1K outperform those trained under a supervised paradigm on a much larger ImageNet-22K, all while requiring approximately half the resources (both in terms of training time and GPU memory).

Moreover, through rigorous experimentation, the paper highlights the optimal settings for the Secondary Masking ratio and fine-tuning protocols. These insights are vital for implementing the UM-MAE framework efficiently in other contexts, emphasizing not only the effectiveness but the generalizability of the UM-MAE strategy.

The paper concludes with an insightful discussion on the contrasting behaviors of vanilla and pyramid-based ViTs under the Masked Image Modeling (MIM) framework. The results imply that vanilla ViTs are especially susceptible to improvements under their methodology, given the intrinsic absence of a locality bias in their architecture—an absence that proves advantageous when compensated for during mask-based pre-training.

Looking ahead, the implications of this work stretch into various applications of computer vision where resource optimization is paramount, and self-supervised pre-training could provide substantial benefits. There is a potential for broad adoption of the UM-MAE strategy in optimizing existing architectures and pioneering new avenues in hierarchical vision modeling, expounding further applications across complex tasks like semantic segmentation and object detection.

In conclusion, this paper embarks on a detailed examination of MAE pre-training suited for pyramid-based vision transformers and proposes an equally insightful solution to this non-trivial problem, combining theory with exhaustive practice, alongside offering guidelines for implementing the novel UM-MAE effectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiang Li (1002 papers)
  2. Wenhai Wang (123 papers)
  3. Lingfeng Yang (12 papers)
  4. Jian Yang (503 papers)
Citations (66)