Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bootstrap Masked Visual Modeling via Hard Patches Mining (2312.13714v1)

Published 21 Dec 2023 in cs.CV

Abstract: Masked visual modeling has attracted much attention due to its promising potential in learning generalizable representations. Typical approaches urge models to predict specific contents of masked tokens, which can be intuitively considered as teaching a student (the model) to solve given problems (predicting masked contents). Under such settings, the performance is highly correlated with mask strategies (the difficulty of provided problems). We argue that it is equally important for the model to stand in the shoes of a teacher to produce challenging problems by itself. Intuitively, patches with high values of reconstruction loss can be regarded as hard samples, and masking those hard patches naturally becomes a demanding reconstruction task. To empower the model as a teacher, we propose Hard Patches Mining (HPM), predicting patch-wise losses and subsequently determining where to mask. Technically, we introduce an auxiliary loss predictor, which is trained with a relative objective to prevent overfitting to exact loss values. Also, to gradually guide the training procedure, we propose an easy-to-hard mask strategy. Empirically, HPM brings significant improvements under both image and video benchmarks. Interestingly, solely incorporating the extra loss prediction objective leads to better representations, verifying the efficacy of determining where is hard to reconstruct. The code is available at https://github.com/Haochen-Wang409/HPM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (126)
  1. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  2. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning (ICML), 2020.
  3. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in Neural Information Processing Systems (NeurIPS), 2020.
  4. X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  5. X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  6. H. Wang, K. Song, J. Fan, Y. Wang, J. Xie, and Z. Zhang, “Hard patches mining for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 375–10 385.
  7. H. Wang, J. Fan, Y. Wang, K. Song, T. Wang, and Z. Zhang, “Droppos: Pre-training vision transformers by reconstructing dropped positions,” Advances in Neural Information Processing Systems (NeurIPS), 2023.
  8. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  9. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  10. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” 2019.
  11. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems (NeurIPS), 2020.
  12. H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” in International Conference on Learning Representations (ICLR), 2022.
  13. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  14. Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  15. C. Feichtenhofer, Y. Li, K. He et al., “Masked autoencoders as spatiotemporal learners,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
  16. Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
  17. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
  18. X. Li, W. Wang, L. Yang, and J. Yang, “Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality,” arXiv preprint arXiv:2205.10063, 2022.
  19. B. Huang, Z. Zhao, G. Zhang, Y. Qiao, and L. Wang, “Mgmae: Motion guided masking for video masked autoencoding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  20. D. Fan, J. Wang, S. Liao, Y. Zhu, V. Bhat, H. Santos-Villalobos, R. MV, and X. Li, “Motion-guided masking for spatiotemporal representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  21. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), 2015.
  22. R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The” something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  23. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  24. C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  25. M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision (ECCV), 2016.
  26. R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision (ECCV), 2016.
  27. Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  28. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  29. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  30. Y. Wang, H. Wang, Y. Shen, J. Fei, W. Li, G. Jin, L. Wu, R. Zhao, and X. Le, “Semi-supervised semantic segmentation using unreliable pseudo-labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  31. H. Wang, Y. Shen, J. Fei, W. Li, L. Wu, Y. Wang, and Z. Zhang, “Pulling target to source: A new perspective on domain adaptive semantic segmentation,” arXiv preprint arXiv:2305.13752, 2023.
  32. H. Wang, Y. Wang, Y. Shen, J. Fan, Y. Wang, and Z. Zhang, “Using unreliable pseudo-labels for label-efficient semantic segmentation,” arXiv preprint arXiv:2306.02314, 2023.
  33. U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition,” in Proceddings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2019.
  34. Y. Huo, M. Ding, H. Lu, Z. Lu, T. Xiang, J.-R. Wen, Z. Huang, J. Jiang, S. Zhang, M. Tang et al., “Self-supervised video representation learning with constrained spatiotemporal jigsaw,” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2020.
  35. D. Kim, D. Cho, and I. S. Kweon, “Self-supervised video representation learning with space-time cubic puzzles,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
  36. N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International Conference on Machine Learning (ICML), 2015.
  37. W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in International Conference on Learning Representations (ICLR), 2016.
  38. M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in International Conference on Learning Representations (ICLR), 2016.
  39. C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  40. J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain future: Forecasting from static images using variational autoencoders,” in European Conference on Computer Vision (ECCV), 2016.
  41. C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in European Conference on Computer Vision (ECCV), 2018.
  42. P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015.
  43. X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015.
  44. D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  45. X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  46. I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in European Conference on Computer Vision (ECCV), 2016.
  47. B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  48. H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  49. D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning and using the arrow of time,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  50. D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-supervised spatiotemporal learning via video clip order prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  51. T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019.
  52. C. Feichtenhofer, H. Fan, B. Xiong, R. Girshick, and K. He, “A large-scale study on unsupervised spatiotemporal representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  53. R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie, and Y. Cui, “Spatiotemporal contrastive video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  54. A. Recasens, P. Luc, J.-B. Alayrac, L. Wang, F. Strub, C. Tallec, M. Malinowski, V. Pătrăucean, F. Altché, M. Valko et al., “Broaden your views for self-supervised video learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  55. C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  56. J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” in International Conference on Learning Representations (ICLR), 2022.
  57. A. Gupta, J. Wu, J. Deng, and L. Fei-Fei, “Siamese masked autoencoders,” Advances in Neural Information Processing Systems (NeurIPS), 2023.
  58. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in International Conference on Machine Learning (ICML), 2008.
  59. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of Machine Learning Research (JMLR), 2010.
  60. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  61. G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representations for automatic colorization,” in European Conference on Computer Vision (ECCV), 2016.
  62. R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: Unsupervised learning by cross-channel prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  63. G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  64. G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, 2006.
  65. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  66. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  67. Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  68. J. T. Rolfe, “Discrete variational autoencoders,” in International Conference on Learning Representations (ICLR), 2017.
  69. J. Liu, X. Huang, Y. Liu, and H. Li, “Mixmim: Mixed and masked image modeling for efficient visual representation learning,” arXiv preprint arXiv:2205.13137, 2022.
  70. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
  71. H. Liu, X. Jiang, X. Li, A. Guo, D. Jiang, and B. Ren, “The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,” arXiv preprint arXiv:2204.08227, 2022.
  72. J. Xie, W. Li, X. Zhan, Z. Liu, Y. S. Ong, and C. C. Loy, “Masked frequency modeling for self-supervised visual pre-training,” arXiv preprint arXiv:2206.07706, 2022.
  73. A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning (ICML), 2022.
  74. K. Yi, Y. Ge, X. Li, S. Yang, D. Li, J. Wu, Y. Shan, and X. Qie, “Masked image modeling with denoising contrast,” arXiv preprint arXiv:2205.09616, 2022.
  75. Z. Wu, Z. Lai, X. Sun, and S. Lin, “Extreme masking for learning instance and distributed visual representations,” arXiv preprint arXiv:2206.04667, 2022.
  76. X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu, “Bootstrapped masked autoencoders for vision bert pretraining,” in European Conference on Computer Vision (ECCV), 2022.
  77. L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  78. K. Li, Y. Wang, Y. Li, Y. Wang, Y. He, L. Wang, and Y. Qiao, “Unmasked teacher: Towards training-efficient video foundation models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  79. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021.
  80. R. Girdhar, A. El-Nouby, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Omnimae: Single model masked pretraining on images and videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  81. A. Gupta, S. Tian, Y. Zhang, J. Wu, R. Martín-Martín, and L. Fei-Fei, “Maskvit: Masked visual pre-training for video prediction,” arXiv preprint arXiv:2206.11894, 2022.
  82. M. Salehi, E. Gavves, C. G. Snoek, and Y. M. Asano, “Time does tell: Self-supervised time-tuning of dense image representations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  83. I. Kakogeorgiou, S. Gidaris, B. Psomas, Y. Avrithis, A. Bursuc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-guided masked image modeling,” in European Conference on Computer Vision (ECCV), 2022.
  84. Y. Shi, N. Siddharth, P. Torr, and A. R. Kosiorek, “Adversarial masking for self-supervised learning,” in International Conference on Machine Learning (ICML), 2022.
  85. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015.
  86. G. Li, H. Zheng, D. Liu, B. Su, and C. Zheng, “Semmae: Semantic-guided masking for learning masked autoencoders,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
  87. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  88. W. Xu, Y. Xu, T. Chang, and Z. Tu, “Co-scale conv-attentional image transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  89. M. Ahmadian, F. Guerin, and A. Gilbert, “Mofo: Motion focused self-supervision for video understanding,” arXiv preprint arXiv:2308.12447, 2023.
  90. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in European Conference on Computer Vision (ECCV), 2020.
  91. L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of Artificial Intelligence Research (JAIR), 1996.
  92. W. G. C. Bandara, N. Patel, A. Gholami, M. Nikkhah, M. Agrawal, and V. M. Patel, “Adamae: Adaptive masking for efficient spatiotemporal learning with masked autoencoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  93. X. Liu, J. Zhou, T. Kong, X. Lin, and R. Ji, “Exploring target representations for masked autoencoders,” arXiv preprint arXiv:2209.03917, 2022.
  94. X. Kong and X. Zhang, “Understanding masked image modeling via learning occlusion invariant feature,” arXiv preprint arXiv:2208.04164, 2022.
  95. X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, G. Zeng, and J. Wang, “Context autoencoder for self-supervised representation learning,” arXiv preprint arXiv:2202.03026, 2022.
  96. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  97. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), 2018.
  98. S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  99. G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in European Conference on Computer Vision (ECCV), 2016.
  100. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  101. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  102. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  103. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
  104. Y. Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021.
  105. Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” https://github.com/facebookresearch/detectron2, 2019.
  106. T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in European Conference on Computer Vision (ECCV), 2018.
  107. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  108. M. Everingham, S. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision (IJCV), 2015.
  109. M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
  110. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  111. M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems (NeurIPS), 2020.
  112. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), 2017.
  113. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  114. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  115. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  116. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  117. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in International Conference on Machine Learning (ICML), 2021.
  118. M. Patrick, D. Campbell, Y. Asano, I. Misra, F. Metze, C. Feichtenhofer, A. Vedaldi, and J. F. Henriques, “Keeping your eye on the ball: Trajectory attention in video transformers,” Advances in Neural Information Processing Systems (NeurIPS), 2021.
  119. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  120. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
  121. A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  122. B. Li, Y. Liu, and X. Wang, “Gradient harmonized single-stage detector,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
  123. Y. Du, Y. Shen, H. Wang, J. Fei, W. Li, L. Wu, R. Zhao, Z. Fu, and Q. Liu, “Learning from future: A novel self-training framework for semantic segmentation,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
  124. Y. Wang, J. Fei, H. Wang, W. Li, L. Wu, R. Zhao, and Y. Shen, “Balancing logit variation for long-tail semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  125. W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, “Salient object detection in the deep learning era: An in-depth survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
  126. W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool, “Unsupervised semantic segmentation by contrasting object mask proposals,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
Citations (3)

Summary

  • The paper introduces a novel hard patches mining strategy that guides masked visual modeling by identifying challenging patches through reconstruction loss.
  • It demonstrates superior performance with models achieving up to 85.8% top-1 accuracy on ImageNet using ViT architectures.
  • The approach enhances training efficiency and applicability in unsupervised learning across both image and video tasks.

Overview of "Bootstrap Masked Visual Modeling via Hard Patches Mining"

The paper "Bootstrap Masked Visual Modeling via Hard Patches Mining" addresses the advancement of Masked Visual Modeling (MVM) through the innovative concept of Hard Patches Mining (HPM). MVM, inspired by Masked LLMing (MLM) from NLP, aims to uncover masked contents in visual data to develop robust visual representations without the need for labeled data. The authors introduce HPM to enable models to autonomously identify and tackle challenging patches within images and videos, thus simulating a learning process that functions both as learner (student) and problem setter (teacher).

Key Contributions and Findings

The primary contribution of this paper is the novel strategy of integrating Hard Patches Mining into the MVM process. The authors argue that traditional MVM methods primarily focus on solving predefined tasks, similar to a student solving a fixed problem set. In contrast, HPM encourages the model to create challenging tasks for itself by predicting which patches of the visual data represent greater 'difficulty' or higher 'reconstruction loss'. This self-directed task generation is posited to enhance the learning process by fostering a deeper understanding of visual content.

The authors empirically demonstrate HPM's significant improvements across image and video benchmarks. Key results include:

  • On the ImageNet-1K dataset, the model with HPM shows up to 84.2% and 85.8% top-1 accuracy using ViT-B and ViT-L architectures, respectively. These results surpass the baseline MAE model trained for twice as many epochs.
  • Under video benchmarks on datasets like Something-Something v2 and Kinetics-400, models integrated with HPM outperform baseline masked video modeling techniques.
  • The precision of loss prediction also contributes to better model performance even when used as an auxiliary training objective without direct improvements to the pretext task, suggesting its broad applicability.

Technical Details

The methodology relies on predicting patch-level reconstruction losses to guide the masking process, effectively identifying patches that are challenging to reconstruct. This is accomplished by integrating an auxiliary loss prediction mechanism into the model. The computation of these losses is made via a carefully designed strategy that prioritizes the relative differences between patches rather than absolute values, thus maintaining the focus on learning the challenging and discriminative aspects of visual content.

Additionally, the authors introduce an 'easy-to-hard' masking strategy which starts with easier, more randomly selected challenges and gradually shifts to more difficult, understanding-driven masking based on the predicted patch difficulties. This is achieved by progressively increasing the proportion of patches selected for masking based on the loss prediction.

Implications and Future Prospects

The introduction of Hard Patches Mining offers substantial insights into creating more versatile and perceptive self-supervised learning models. By allowing models to autonomously determine and address complex visual tasks, this paper sets the stage for more adaptive AI systems capable of learning from fewer examples and applying learned knowledge to new and varying tasks.

Practically, HPM may improve performance in domains where generating labeled data is challenging, offering enhancements to fields like medical imaging, autonomous driving, and surveillance systems, where unsupervised or semi-supervised learning is crucial.

Theoretically, this work prompts further investigations into self-supervised learning paradigms, particularly in integrating dual roles of learner and task-generator within AI models. Future research might expand on integrating HPM into broader AI systems, exploring its potential synergies with novel architectural models or hybrid learning frameworks.

Overall, this paper underscores the benefits of enhancing model autonomy in visual representation learning through strategic task difficulty modulation, which could inspire more refined approaches in unsupervised and self-supervised model development.

Github Logo Streamline Icon: https://streamlinehq.com