An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training (2404.12210v2)
Abstract: Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the \textit{extremely simple} lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology. We use an observation-analysis-solution flow for our study. We first systematically observe different behaviors among the evaluated pre-training methods with respect to the downstream fine-tuning data scales. Furthermore, we analyze the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory transfer performance on data-insufficient downstream tasks. This finding is naturally a guide to designing our distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments have demonstrated the effectiveness of our approach. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4\%$/$78.9\%$ top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task ($42.8\%$ mIoU) and LaSOT tracking task ($66.1\%$ AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 9650–9660.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9729–9738.
- X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” ArXiv, vol. abs/2003.04297, 2020.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Pires, Z. Guo, M. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” in Proc. 34th Conf. Adv. Neural Info. Process. Syst., 2020.
- M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Proc. Int. Conf. Adv. Neural Info. Process. Syst., 2020.
- X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 9640–9649.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
- Z. Fang, J. Wang, L. Wang, L. Zhang, Y. Yang, and Z. Liu, “SEED: Self-supervised distillation for visual representation,” in Proc. Int. Conf. Learn. Representations, 2020.
- S. Abbasi Koohpayegani, A. Tejankar, and H. Pirsiavash, “CompRess: Self-Supervised learning by compressing representations,” Proc. Int. Conf. Adv. Neural Info. Process. Syst., vol. 33, pp. 12 980–12 992, 2020.
- H. M. Choi, H. Kang, and D. Oh, “Unsupervised representation transfer for small networks: I believe i can distill on-the-fly,” in Proc. Int. Conf. Adv. Neural Info. Process. Syst., 2021.
- H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of image transformers,” in Proc. Int. Conf. Learn. Representations, 2022.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 15 979–15 988.
- J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “iBOT: Image bert pre-training with online tokenizer,” in Proc. Int. Conf. Learn. Representations, 2022.
- C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer, “Hiera: A hierarchical vision transformer without the bells-and-whistles,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 29 441–29 454.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2020.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn., vol. 139, 2021, pp. 10 347–10 357.
- B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and M. Douze, “LeViT: A vision transformer in convnet’s clothing for faster inference,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 12 259–12 269.
- A. Ali, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek et al., “Xcit: Cross-covariance image transformers,” Proc. Int. Conf. Adv. Neural Info. Process. Syst., vol. 34, 2021.
- B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking spatial dimensions of vision transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 11 936–11 945.
- H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, “Going deeper with image transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 32–42.
- S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. Representations, 2022.
- Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-Former: Bridging mobilenet and transformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5260–5269.
- J. Pan, A. Bulat, F. Tan, X. Zhu, L. Dudziak, H. Li, G. Tzimiropoulos, and B. Martínez, “EdgeViTs: Competing light-weight cnns on mobile devices with vision transformers,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 294–311.
- P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan, “FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 5762–5772.
- Y. Li, J. Hu, Y. Wen, G. Evangelidis, K. Salahi, Y. Wang, S. Tulyakov, and J. Ren, “Rethinking vision transformers for MobileNet size and speed,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 16 843–16 854.
- H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “EfficientViT: Lightweight multi-scale attention for high-resolution dense prediction,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 17 256–17 267.
- Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “SeaFormer: Squeeze-enhanced axial transformer for mobile semantic segmentation,” in Proc. Int. Conf. Learn. Representations, 2023.
- Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “SimMIM: A simple framework for masked image modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022.
- A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. 31st Conf. Adv. Neural Info. Process. Syst., 2017.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8821–8831.
- X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu, “Bootstrapped masked autoencoders for vision BERT pretraining,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 247–264.
- C. Wei, H. Fan, S. Xie, C. Wu, A. L. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14 648–14 658.
- S. Ren, F. Wei, Z. Zhang, and H. Hu, “TinyMIM: An empirical study of distilling MIM pre-trained models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 3687–3697.
- S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in Proc. Int. Conf. Learn. Representations, 2018.
- R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 649–666.
- M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 69–84.
- A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” Proc. Int. Conf. Adv. Neural Info. Process. Syst., vol. 27, pp. 766–774, 2014.
- D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.
- Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” in Proc. Int. Conf. Learn. Representations, 2020.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 139–156.
- M. Caron, P. Bojanowski, J. Mairal, and A. Joulin, “Unsupervised pre-training of image features on non-curated data,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 2959–2968.
- J. Huang, Q. Dong, S. Gong, and X. Zhu, “Unsupervised deep learning by neighbourhood discovery,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2849–2858.
- J. Li, P. Zhou, C. Xiong, and S. C. H. Hoi, “Prototypical contrastive learning of unsupervised representations,” in Proc. Int. Conf. Learn. Representations, 2021.
- J. Xie, R. B. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 478–487.
- J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5147–5156.
- C. Zhuang, A. L. Zhai, and D. Yamins, “Local aggregation for unsupervised learning of visual embeddings,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 6001–6011.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. Int. Conf. Mach. Learn. PMLR, 2020, pp. 1597–1607.
- X. Chen and K. He, “Exploring simple siamese representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15 750–15 758.
- A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe, “Whitening for self-supervised representation learning,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 3015–3024.
- J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 12 310–12 320.
- S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord, “Learning representations by predicting bags of visual words,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6926–6936.
- P. Bojanowski and A. Joulin, “Unsupervised learning by predicting noise,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 517–526.
- A. El-Nouby, G. Izacard, H. Touvron, I. Laptev, H. Jegou, and E. Grave, “Are large-scale datasets necessary for self-supervised pre-training?” ArXiv, vol. abs/2112.10740, 2021.
- X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, G. Zeng, and J. Wang, “Context autoencoder for self-supervised representation learning,” International Journal of Computer Vision, vol. 132, no. 1, pp. 208–223, 2024.
- H. Wang, K. Song, J. Fan, Y. Wang, J. Xie, and Z. Zhang, “Hard patches mining for masked image modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10 375–10 385.
- I. Kakogeorgiou, S. Gidaris, B. Psomas, Y. Avrithis, A. Bursuc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-Guided masked image modeling,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 300–318.
- Y. Shi, N. Siddharth, P. H. S. Torr, and A. R. Kosiorek, “Adversarial masking for self-supervised learning,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 20 026–20 040.
- G. Li, H. Zheng, D. Liu, C. Wang, B. Su, and C. Zheng, “SemMAE: Semantic-Guided masking for learning masked autoencoders,” in Proc. 36th Conf. Adv. Neural Info. Process. Syst., 2022.
- X. Li, W. Wang, L. Yang, and J. Yang, “Uniform masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality,” ArXiv, vol. abs/2205.10063, 2022.
- A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 1298–1312.
- X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, N. Yu, and B. Guo, “PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers,” in AAAI Conf. on Artificial Intelligence, 2023, pp. 552–560.
- Z. Hou, F. Sun, Y. Chen, Y. Xie, and S. Kung, “Milan: Masked Image Pretraining on Language Assisted Representation,” ArXiv, vol. abs/2208.06049, 2022.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
- Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “BEiTv2: Masked Image Modeling with Vector-Quantized Visual Tokenizers,” ArXiv, vol. abs/2208.06366, 2022.
- S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXtV2: Co-designing and Scaling ConvNets with Masked Autoencoders,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 16 133–16 142.
- P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao, “MCMAE: Masked Convolution Meets Masked Autoencoders,” in Proc. 36th Conf. Adv. Neural Info. Process. Syst., 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. Int. Conf. Adv. Neural Info. Process. Syst., vol. 30, 2017.
- S. Abnar, M. Dehghani, and W. Zuidema, “Transferring inductive biases through knowledge distillation,” ArXiv, vol. abs/2006.00555, 2020.
- A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, “How to train your vit? data, augmentation, and regularization in vision transformers,” Trans. Mach. Learn. Res., vol. 2022, 2022.
- H. Touvron, M. Cord, and H. Jégou, “DeiT III: Revenge of the ViT,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 516–533.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent., May 2015.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6848–6856.
- K. Han, Y. Wang, Q. Zhang, W. Zhang, C. Xu, and T. Zhang, “Model Rubik’s Cube: Twisting resolution, depth and width for TinyNets,” in Proc. 34th Conf. Adv. Neural Info. Process. Syst., 2020.
- I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 428–10 436.
- Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 966–11 976.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6105–6114.
- A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for MobileNetV3,” in Proc. Int. Conf. Comput. Vis., 2019.
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 19–35.
- H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in Proc. Int. Conf. Learn. Representations, 2019.
- H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4092–4101.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 10 012–10 022.
- X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “CSWin Transformer: A general vision transformer backbone with cross-shaped windows,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12 114–12 124.
- W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid Vision Transformer: A versatile backbone for dense prediction without convolutions,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 548–558.
- L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. H. Tay, J. Feng, and S. Yan, “Tokens-to-Token ViT: Training vision transformers from scratch on imagenet,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 538–547.
- A. Srinivas, T. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16 519–16 529.
- S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “ConViT: Improving vision transformers with soft convolutional inductive biases,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 2286–2296.
- S. Singh and A. Shrivastava, “CvT: Introducing convolutions to vision transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 22–31.
- W. Xu, Y. Xu, T. A. Chang, and Z. Tu, “Co-Scale conv-attentional image transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 9961–9970.
- T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. B. Girshick, “Early convolutions help transformers see better,” in Proc. 35th Conf. Adv. Neural Info. Process. Syst., 2021.
- S. Mehta and M. Rastegari, “Separable self-attention for mobile vision transformers,” in Proc. Int. Conf. Learn. Representations, 2023.
- S. N. Wadekar and A. Chaurasia, “MobileViTv3: Mobile-Friendly vision transformer with simple and effective fusion of local, global and input features,” ArXiv, vol. abs/2209.15159, 2022.
- M. Maaz, A. Shaker, H. Cholakkal, S. H. Khan, S. W. Zamir, R. M. Anwer, and F. S. Khan, “EdgeNeXt: Efficiently amalgamated cnn-transformer architecture for mobile vision applications,” in Proc. Eur. Conf. Comput. Vis. Worksh., 2022, pp. 3–20.
- S. Wang, J. Gao, Z. Li, X. Zhang, and W. Hu, “A closer look at self-supervised lightweight vision transformers,” in Proc. Int. Conf. Mach. Learn., 2023.
- C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining, 2006, pp. 535–541.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” ArXiv, vol. abs/1503.02531, 2015.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” in Proc. Int. Conf. Learn. Representations, 2015.
- S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in Proc. Int. Conf. Learn. Representations, 2017.
- W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2019.
- G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient object detection models with knowledge distillation,” Proc. Int. Conf. Adv. Neural Info. Process. Syst., vol. 30, 2017.
- T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, “Knowledge adaptation for efficient semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 578–587.
- V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” ArXiv, vol. abs/1910.01108, 2019.
- X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “TinyBERT: Distilling bert for natural language understanding,” in Findings of Empirical Methods in Natural Language Process., 2020, pp. 4163–4174.
- W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “MiniLMv2: Multi-Head self-attention relation distillation for compressing pretrained transformers,” in Findings of Int. Joint Conf. on Natural Language Process., 2021, pp. 2140–2151.
- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “MobileBERT: A Compact Task-Agnostic BERT For Resource-Limited Devices,” in Association for Computational Linguistics, 2020, pp. 2158–2170.
- Y. Gao, J. Zhuang, S. Lin, H. Cheng, X. Sun, K. Li, and C. Shen, “DisCo: Remedying self-supervised learning on lightweight models with distilled contrastive learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 237–253.
- Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen, and B. Guo, “Contrastive learning rivals masked image modeling in fine-tuning via feature distillation,” ArXiv, vol. abs/2205.14141, 2022.
- C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, and J. Gao, “Efficient self-supervised vision transformers for representation learning,” in Proc. Int. Conf. Learn. Representations, 2022.
- P. Bhat, E. Arani, and B. Zonooz, “Distill on the Go: Online knowledge distillation in self-supervised learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2678–2687.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” in Proc. Int. Conf. Learn. Representations, 2019.
- G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in Proc. Eur. Conf. Comput. Vis. Springer, 2020, pp. 588–604.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, pp. 140:1–140:67, 2020.
- H. Bao, L. Dong, F. Wei, W. Wang, N. Yang, X. Liu, Y. Wang, J. Gao, S. Piao, M. Zhou, and H. Hon, “UniLMv2: Pseudo-Masked language models for unified language model pre-training,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 642–652.
- H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3588–3597.
- H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3463–3472.
- P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in North American Chapter of the Association for Computational Linguistics, 2018, pp. 464–468.
- A. Newell and J. Deng, “How useful is self-supervised pretraining for visual tasks?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 7345–7354.
- X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional positional encodings for vision transformers,” in Proc. Int. Conf. Learn. Representations, 2023.
- M. Assran, M. Caron, I. Misra, P. Bojanowski, A. Joulin, N. Ballas, and M. Rabbat, “Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 8443–8452.
- Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Large-scale long-tailed recognition in an open world,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019.
- M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Indian Conference on Computer Vision, Graphics & Image Processing, 2008, pp. 722–729.
- O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats and dogs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3498–3505.
- S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” ArXiv, vol. abs/1306.5151, 2013.
- J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Int. Conf. Comput. Vis. Worksh., 2013, pp. 554–561.
- A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” Technical Report, 2009.
- G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
- G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 646–661.
- T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
- B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ADE20K dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5122–5130.
- H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “LaSOT: A high-quality benchmark for large-scale single object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5369–5378.
- Y. Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” ArXiv, vol. abs/2111.11429, 2021.
- T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 432–448.
- B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 341–357.
- C. Cortes, M. Mohri, and A. Rostamizadeh, “Algorithms for learning kernels based on centered alignment,” The Journal of Machine Learning Research, vol. 13, pp. 795–828, 2012.
- T. Nguyen, M. Raghu, and S. Kornblith, “Do wide and deep networks learn the same things???? uncovering how neural network representations vary with width and depth,” in Proc. Int. Conf. Learn. Representations, 2020.
- L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt, “Feature selection via dependence maximization.” The Journal of Machine Learning Research, vol. 13, no. 5, 2012.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” ArXiv, vol. abs/1607.06450, 2016.
- H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 6804–6815.
- Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “MViTv2: Improved multiscale vision transformers for classification and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4794–4804.
- L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, and A. Kolesnikov, “Knowledge distillation: A good teacher is patient and consistent,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10 915–10 924.
- X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, and Y. Yuan, “EfficientViT: Memory efficient vision transformer with cascaded group attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 14 420–14 430.
- T. Huang, L. Huang, S. You, F. Wang, C. Qian, and C. Xu, “LightViT: Towards light-weight convolution-free vision transformers,” ArXiv, vol. abs/2207.05557, 2022.
- B. Kang, X. Chen, D. Wang, H. Peng, and H. Lu, “Exploring lightweight hierarchical vision transformers for efficient visual tracking,” in Proc. Int. Conf. Comput. Vis., 2023, pp. 9578–9587.
- B. Yan, H. Peng, K. Wu, D. Wang, J. Fu, and H. Lu, “LightTrack: Finding lightweight neural networks for object tracking via one-shot architecture search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15 180–15 189.
- P. Blatter, M. Kanakis, M. Danelljan, and L. V. Gool, “Efficient visual tracking with exemplar transformers,” in IEEE Conf. on Applications Comput. Vis., 2023, pp. 1571–1581.
- V. Borsuk, R. Vei, O. Kupyn, T. Martyniuk, I. Krashenyi, and J. Matas, “FEAR: Fast, efficient, accurate and robust visual tracker,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 644–663.
- X. Chen, B. Kang, D. Wang, D. Li, and H. Lu, “Efficient visual tracking via hierarchical cross-attention transformer,” in Proc. Eur. Conf. Comput. Vis. Worksh., 2022, pp. 461–477.
- Y. Cui, T. Song, G. Wu, and L. Wang, “MixFormerV2: Efficient fully transformer tracking,” in Proc. 37th Conf. Adv. Neural Info. Process. Syst., 2023.
- A. Kirillov, R. B. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6399–6408.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” ArXiv, vol. abs/1706.02677, 2017.
- I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proc. Int. Conf. Learn. Representations, 2017.
- E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le, “RandAugment: Practical automated data augmentation with a reduced search space,” in Proc. Int. Conf. Adv. Neural Info. Process. Syst., vol. 33, 2020, pp. 18 613–18 624.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. Int. Conf. Learn. Representations, 2018.
- S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 6023–6032.