Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization (2402.18128v1)
Abstract: Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at: https://github.com/Alexiland/MLOMAE
- A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
- BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=p-BhZSz59o4.
- Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782, 2017.
- Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR, 2019.
- Improving masked autoencoders by learning where to mask. arXiv preprint arXiv:2303.06583, 2023.
- Generative pretraining from pixels. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1691–1703. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/chen20s.html.
- Betty: An automatic differentiation library for multilevel optimization. arXiv preprint arXiv:2207.02849, 2022.
- Making scalable meta learning practical. arXiv preprint arXiv:2310.05674, 2023.
- Contributors, M. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Initializing bayesian hyperparameter optimization via meta-learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 1128–1135. AAAI Press, 2015. ISBN 0262511290.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126–1135. PMLR, 2017.
- Learning from mistakes - a framework for neural architecture search. In AAAI Conference on Artificial Intelligence, 2021.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends. arXiv preprint arXiv:2301.05712, 2023.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Saliency-aware neural architecture search. Advances in Neural Information Processing Systems, 35:14743–14757, 2022.
- Dsrna: Differentiable search of robust neural architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6196–6205, 2021.
- Fair and accurate decision making through group-aware learning. In International Conference on Machine Learning, pp. 13254–13269. PMLR, 2023.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- What to hide from your students: Attention-guided masked image modeling. In European Conference on Computer Vision, pp. 300–318. Springer, 2022.
- Understanding masked image modeling via learning occlusion invariant feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6241–6251, 2023.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013.
- Cifar-10 (canadian institute for advanced research). a. URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Cifar-100 (canadian institute for advanced research). b. URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35:14290–14302, 2022.
- Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems, 34:13165–13176, 2021.
- Darts+: Improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035, 2019.
- Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Good helper is around you: Attention-driven masked image modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 1799–1807, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Cl-mae: Curriculum-learned masked autoencoders. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2492–2502, 2024.
- Softsort: A continuous relaxation for the argsort operator. In International Conference on Machine Learning, pp. 7793–7802. PMLR, 2020.
- Learning by teaching, with application to neural architecture search. arXiv preprint arXiv:2103.07009, 2021.
- Adversarial masking for self-supervised learning. In International Conference on Machine Learning, pp. 20026–20040. PMLR, 2022.
- The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778, 2018.
- Bilevel and multilevel programming: A bibliography review, 1994.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Hard patches mining for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10375–10385, 2023.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp. 418–434, 2018.
- SNAS: stochastic neural architecture search. In ICLR, 2019.
- PC-DARTS: partial channel connections for memory-efficient architecture search. In ICLR, 2020.
- How mask matters: Towards theoretical understandings of masked autoencoders. Advances in Neural Information Processing Systems, 35:27127–27139, 2022.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017.
- Han Guo (44 papers)
- Ramtin Hosseini (8 papers)
- Ruiyi Zhang (98 papers)
- Sai Ashish Somayajula (8 papers)
- Ranak Roy Chowdhury (6 papers)
- Rajesh K. Gupta (15 papers)
- Pengtao Xie (86 papers)