Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning (2312.02194v1)
Abstract: In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining or improving learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (~12.5\%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively, underscoring its potential in scenarios where computational resources and time are critical. This work marks an advancement in the field of self-supervised learning for computer vision. The implementation of our approach is available at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.
- Anonymous. Fastmim. arXiv preprint arXiv:2212.06593, 2023a.
- Anonymous. Automated heterogeneous low-bit quantization of multi-model deep learning inference pipelines. arXiv preprint arXiv:2306.07215, 2023b.
- Anonymous. Pruning during training by network efficacy modeling. SpringerLink, 2023c.
- Anonymous. Efficient post-training quantization with fp8 formats. arXiv preprint arXiv:2306.07215, 2023d.
- Anonymous. Efficient quantization-aware training with adaptive bit-widths. arXiv preprint arXiv:2306.07215, 2023e.
- Anonymous. A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations. arXiv preprint arXiv:2308.06767, 2023f.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
- Freezeout: Accelerate training by progressively freezing layers. arXiv preprint arXiv:1706.04983, 2017.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020.
- Which layer is learning faster? a systematic exploration of layer-wise convergence rate for deep neural networks. In The Eleventh International Conference on Learning Representations, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019.
- Pruning deep neural networks from a sparsity perspective. arXiv preprint arXiv:2308.06767, 2023.
- Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892, 2022.
- Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems, 2020.
- Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Self-adaptive training: Bridging supervised and self-supervised learning. arXiv preprint arXiv:2101.08732, 2021.
- Green hierarchical vision transformer for masked image modeling. arXiv preprint arXiv:2205.13515, 2022.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, pages 4475–4483. PMLR, 2020.
- Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6874–6883, 2017.
- Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. In arXiv preprint arXiv:2205.10063, 2022.
- Haotian Liu et al. Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. arXiv preprint arXiv:2205.13137, 2022.
- Autofreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386, 2021.
- Junjun Ma et al. Disjoint masking with joint distillation for efficient masked image modeling. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Fabian Mentzer et al. M2t: Masking transformers twice for faster decoding. arXiv preprint arXiv:2304.07313, 2023.
- OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/chatgpt/, 2023.
- Context encoders: Feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
- Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, pages 1096–1103, 2008.
- Masked image modeling with local multi-scale reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2122–2131, 2023a.
- Qiming Wang et al. What to hide from your students: Attention-guided masked image modeling. In Springer, 2023b.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
- Egeria: Efficient dnn training with knowledge-guided layer freezing. arXiv preprint arXiv:2201.06227, 2022.
- Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133, 2021.
- Fast deep learning training through intelligently freezing layers. In 2019 International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pages 1225–1232. IEEE, 2019.
- Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
- Hongxu Xue et al. Stare at what you see: Masked image modeling without reconstruction. In CVPR, 2023.
- Efficient self-supervised continual learning with progressive task-correlated layer freezing. arXiv preprint arXiv:2303.07477, 2023.
- Accelerating ttraining of ttransformer-bbased l language mmodels with pprogressive llayer ddropping. Advances in Neural Information Processing Systems, 33:14011–14023, 2020.
- Utku Mert Topcuoglu (1 paper)
- Erdem Akagündüz (20 papers)