Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection (2204.02964v2)

Published 6 Apr 2022 in cs.CV

Abstract: We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% $\sim$ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained large kernel patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the 3${rd}$-stage of our detector's backbone instead of the whole feature extractor. This results in a ConvNet-ViT hybrid feature extractor. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 box AP and 2.6 mask AP on COCO, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8$\times$ faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet.

Overview of "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection"

This paper presents "MimDet," an approach for adapting Masked Image Modeling (MIM) pre-trained Vanilla Vision Transformers (ViTs) for object detection. The authors leverage two key observations: first, that these models perform significantly well even with only 25% to 50% partial observations of the input embeddings; second, that constructing multi-scale representations can be simplified by supplementing a randomly initialized compact convolutional stem, replacing the original large kernel patchify stem.

Key Contributions

  • Partial Input Utilization: The paper reveals that a MIM pre-trained ViT encoder can effectively handle object-level recognition tasks when exposed to a fraction of the input embeddings, specifically between 25% to 50%. This results in reduced computational requirements while maintaining competitive accuracy.
  • Hybrid Architecture Design: By introducing a convolutional stem to replace the patchify stem and utilizing the ViT encoder as part of a hierarchical feature extractor, the authors develop a ConvNet-ViT hybrid. This architecture is more efficient in creating multi-scale pyramid representations essential for object detection.
  • Efficient Training and Convergence: The proposed MimDet approach shows superior convergence rates, being 2.8 times faster than previous best-adapted vanilla ViT detectors. With a streamlined fine-tuning strategy, MimDet achieves a balance between model complexity and training efficiency.

Numerical Results

MimDet demonstrates a 2.5 point improvement in box detection and 2.6 in mask detection on the COCO dataset compared to Swin Transformers. Additionally, it surpasses previously adapted vanilla ViT detectors using a more modest fine-tuning setup, with notable improvements in both box and mask accuracy, highlighting both computational and practical implications.

Implications and Future Directions

  • Scalability and Adaptation: The research suggests that developing specific architectures for object detection might become less crucial as more general representations are leveraged effectively. This might lead to broader implications for model design across visual understanding tasks.
  • Potential for Other Domains: The work paves the way for similar methods in domains that can benefit from transformer models, including video analysis, multi-task learning, and multi-modal integration.
  • Exploration of MIM in ViTs: Insights from leveraging MAE (Masked Autoencoder) pre-trained decoders indicate potential research directions that could explore further adaptations and optimizations.

Conclusion

This paper contributes to the evolving discourse on adapting Vision Transformers for practical tasks such as object detection. By introducing innovative solutions to utilize MIM pre-trained ViTs, the authors demonstrate improved performance and efficiency, suggesting a promising trajectory for future research in artificial intelligence and computer vision. The findings encourage leveraging pre-trained representations through straightforward architectural adjustments, resonating with current trends in machine learning toward simplicity and effectiveness.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuxin Fang (14 papers)
  2. Shusheng Yang (16 papers)
  3. Shijie Wang (62 papers)
  4. Yixiao Ge (99 papers)
  5. Ying Shan (252 papers)
  6. Xinggang Wang (163 papers)
Citations (52)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com