Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CL-MAE: Curriculum-Learned Masked Autoencoders (2308.16572v3)

Published 31 Aug 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders. We release our code at https://github.com/ristea/cl-mae.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. A cookbook of self-supervised learning. arXiv:2304.12210, 2023.
  2. BEiT: BERT Pre-Training of Image Transformers. In Proceeding of ICLR, 2022.
  3. Curriculum learning. In Proceedings of ICML, pages 41–48, 2009.
  4. A curriculum learning method for improved noise robustness in automatic speech recognition. In Proceedings of EUSIPCO, pages 548–552, 2017.
  5. End-to-end object detection with transformers. In Proceedings of ECCV, pages 213–229, 2020.
  6. Improving masked autoencoders by learning where to mask. arXiv:2303.06583, 2023.
  7. Mixed autoencoder for self-supervised visual representation learning. In Proceeding of CVPR, pages 22742–22751, 2023.
  8. A simple framework for contrastive learning of visual representations. In Proceedings of ICML, pages 1597–1607, 2020.
  9. Webly supervised learning of convolutional networks. In Proceeding of ICCV, pages 1431–1439, 2015.
  10. Samuel Cortinhas. Sport Balls - Multiclass Image Classification. https://www.kaggle.com/datasets/samuelcortinhas/sports-balls-multiclass-image-classification, 2022. Accessed: 2023-08-28.
  11. Urban scene semantic segmentation with low-cost coarse annotation. In Proceedings of WACV, pages 5978–5987, 2023.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pages 4171–4186, 2019.
  13. Unsupervised visual representation learning by context prediction. In Proceedings of ICCV, pages 1422–1430, 2015.
  14. PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. In Proceedings of AAAI, pages 552–560, 2023.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of ICLR, 2021.
  16. Curriculum DeepSDF. In Proceedings of ECCV, pages 51–67, 2020.
  17. Airbus DS GEO. Airbus Wind Turbines Patches. https://www.kaggle.com/datasets/airbusgeo/airbus-wind-turbines-patches, 2022. Accessed: 2023-08-28.
  18. Audiovisual Masked Autoencoders. In Proceedings of ICCV, 2023.
  19. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of ECCV, pages 540–557, 2021.
  20. Generative adversarial nets. In Proceedings of NIPS, pages 2672–2680, 2014.
  21. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of NeurIPS, pages 21271–21284, 2020.
  22. On the power of curriculum learning in training deep networks. In Proceedings of ICML, pages 2535–2544, 2019.
  23. Masked autoencoders are scalable vision learners. In Proceedings of CVPR, pages 16000–16009, 2022.
  24. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of CVPR, pages 9729–9738, 2020.
  25. Deep Residual Learning for Image Recognition. In Proceedings of CVPR, pages 770–778, 2016.
  26. How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image. In Proceeding of CVPR, pages 2157–2166, 2016.
  27. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In Proceeding of ICML, page 2304–2313, 2017.
  28. Architectural Heritage Elements Image Dataset. https://www.kaggle.com/datasets/ikobzev/architectural-heritage-elements-image64-dataset, 2021. Accessed: 2023-08-28.
  29. Self-paced learning for latent variable models. In Proceeding of NeurIPS, volume 23, 2010.
  30. SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders. In Proceedings of NeurIPS, pages 14290–14302, 2022.
  31. Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality. arXiv:2205.10063, 2022.
  32. Boosting self-supervised learning via knowledge transfer. In Proceedings of CVPR, pages 9359–9367, 2018.
  33. Context Encoders: Feature Learning by Inpainting. In Proceedings of CVPR, pages 2536–2544, 2016.
  34. Curriculum learning of multiple tasks. In Proceeding of CVPR, pages 5492–5500, 2014.
  35. Unsupervised learning of dense visual representations. In Proceedings of NeurIPS, pages 4489–4500, 2020.
  36. Competence-based curriculum learning for neural machine translation. In Proceedings of NAACL, pages 1162–1172, 2019.
  37. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceeding of CVPR, pages 19446–19455, 2023.
  38. Zero-shot text-to-image generation. In Proceedings of ICML, pages 8821–8831, 2021.
  39. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  40. Recurrent neural network language model adaptation with curriculum learning. Computer Speech & Language, 33:136–154, 2015.
  41. Adversarial masking for self-supervised learning. In Proceeding of ICML, pages 20026–20040, 2022.
  42. Training region-based object detectors with online hard example mining. In Proceedings of CVPR, pages 761–769, 2016.
  43. Curriculum learning: A survey. International Journal of Computer Vision, 130(6):1526–1565, 2022.
  44. From baby steps to leapfrog: How “less is more” in unsupervised dependency parsing. In Proceedings of NAACL, pages 751–759, 2010.
  45. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Proceedings of NeurIPS, pages 10078–10093, 2022.
  46. Training data-efficient image transformers & distillation through attention. In Proceedings of ICML, pages 10347–10357, 2021.
  47. Revisiting contrastive methods for unsupervised learning of visual representations. In Proceedings of NeurIPS, pages 16238–16250, 2021.
  48. Lanz Vencer. Sea animals image dataset. https://www.kaggle.com/datasets/vencerlanz09/sea-animals-image-dataste, 2023. Accessed: 2023-08-28.
  49. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of CVPR, pages 6312–6322, 2023.
  50. FreeSOLO: Learning To Segment Objects Without Annotations. In Proceedings of CVPR, pages 14176–14186, 2022.
  51. Masked feature prediction for self-supervised visual pre-training. In Proceeding of CVPR, pages 14648–14658, 2022.
  52. Curriculum learning by transfer learning: Theory and experiments with deep networks. In Proceedings of ICML, pages 5238–5246, 2018.
  53. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of ICCV, pages 22–31, 2021.
  54. AID: A Benchmark Dataset for Performance Evaluation of Aerial Scene Classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017.
  55. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of CVPR, pages 9653–9663, 2022.
  56. How mask matters: Towards theoretical understandings of masked autoencoders. pages 27127–27139, 2022.
  57. Colorful image colorization. In Proceedings of ECCV, pages 649–666, 2016.
  58. End-to-end object detection with adaptive clustering transformer. In Proceedings of BMVC, 2020.
  59. Task-customized masked autoencoder via mixture of cluster-conditional experts. In Proceedings of ICLR, 2023.
  60. Image BERT pre-training with online tokenizer. In Proceedings of ICLR, 2022.
  61. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of ICLR, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Neelu Madan (4 papers)
  2. Nicolae-Catalin Ristea (27 papers)
  3. Kamal Nasrollahi (16 papers)
  4. Thomas B. Moeslund (51 papers)
  5. Radu Tudor Ionescu (103 papers)
Citations (6)