Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling (2301.03580v2)

Published 9 Jan 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.

Citations (82)

View on Semantic Scholar

Summary

The paper presents SparK, a method that adapts BERT-like masked modeling to convnets using sparse convolution to manage irregular input patterns.
It employs a hierarchical decoder that leverages multi-scale features inherent to convnets, enhancing reconstruction and model efficiency.
Empirical results on ImageNet and COCO demonstrate significant performance gains over existing methods, highlighting improved feature learning and transferability.

Analyzing "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"

The paper, "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling," addresses the complex challenge of adapting BERT-style masked pre-training, commonly employed in NLP, to convolutional networks (convnets). This research introduces Sparse masKed modeling (SparK), a method that harnesses sparse convolution to encode unmasked patches of images, overcoming key obstacles in transferring the BERT paradigm to visual data.

Key Contributions

The authors identify two significant hurdles in applying BERT-like pre-training to convnets:

Irregular Input Handling: Convnets traditionally process regular input grids, unlike transformers which handle variable-length input sequences. SparK mitigates this by treating unmasked image patches as sparse voxels, leveraging sparse convolution to process these irregular inputs efficiently.
Hierarchical Structure: Convnets have a natural multi-scale, hierarchical structure. SparK aligns with this by incorporating a hierarchical decoder, using multi-scale encoded features to reconstruct input. This allows convnets to fully utilize their inherent hierarchical advantages during pre-training.

Methodology

SparK extends the masked modeling technique by utilizing sparse convolutional processes, which are typically employed in 3D point cloud processing. This adaptation is transformative, allowing the efficient handling of masked image modeling directly within the convnet framework without altering the backbone architecture.

Sparse Convolution: By encoding only unmasked pixels, SparK avoids distribution shifts and preserves the pattern integrity of masks through convolution layers. This results in efficient computation, with significant savings in both memory and processing time.
Hierarchical Decoder: The multi-scale decoding strategy complements the convnet architecture, allowing effective feature utilization. This design aligns with convolutional characteristics, effectively scaling masked modeling to exploit convnet strengths.

Empirical Evaluation

The paper's empirical results demonstrate SparK's potential, showing substantive performance gains in standard tasks, notably surpassing both contrastive learning methods and transformer-based masked modeling.

ImageNet Classification: SparK consistently outperforms models like SimMIM and iBOT on both small and base transformer backbones, illustrating its efficiency in leveraging convnet pre-training.
COCO Object Detection and Segmentation: The improvement margins are more pronounced in tasks requiring spatial understanding, showcasing SparK's effective feature learning and transferability across tasks.

Implications and Future Directions

SparK exemplifies a successful adaptation of transformer-like masked modeling to convnets, reaffirming the suitability of convnets for various vision tasks when enhanced by generative pre-training. As vision transformers have garnered much of the recent focus, SparK represents a meaningful step in reinvigorating convnet research with impactful pre-training strategies.

The paper hints at possible future directions, particularly in scaling SparK with larger networks and in different application domains. By overcoming conventional barriers associated with hierarchy and input irregularity in convnets, SparK may inspire similar approaches that further refine and expand on masked modeling within varied network architectures.

In summary, SparK's novel utilization of sparse convolution and its hierarchical approach mark a significant advancement in masked image modeling, offering a promising outlook for the pre-training of convolutional networks in image processing tasks.

PDF Markdown

Related Papers

GitHub

GitHub - keyu-tian/SparK: [ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling" (1,426 stars)

YouTube

Show All Videos