Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models (2311.18237v3)

Published 30 Nov 2023 in cs.CV and cs.LG

Abstract: Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data?", and propose a simple task-oriented knowledge transfer approach as a highly effective solution to this problem. Our experimental results on five target tasks show that the proposed approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining, supervised ImageNet pretraining, and self-supervised DINO pretraining by up to 11.6%, 22.1%, 13.7%, and 29.8%, respectively. Furthermore, the proposed approach also demonstrates up to 9x, 4x and 15x reduction in pretraining compute cost when compared to task-agnostic VFM distillation, ImageNet pretraining and DINO pretraining, respectively, while outperforming them. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and introduce a retrieval-augmented knowledge transfer strategy that uses web-scale image retrieval to curate effective transfer sets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Beit: BERT pre-training of image transformers. In ICLR, 2022.
  2. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 2019.
  3. Distilling from similar tasks for transfer learning on a budget. arXiv:2304.12314, 2023.
  4. Less is more: Removing text-regions improves CLIP training efficiency and robustness. arXiv:2305.05095, 2023.
  5. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  6. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  7. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  8. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
  9. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  10. Exploring simple siamese representation learning. In CVPR, 2021.
  11. Semi-supervised and unsupervised deep visual learning: A survey. arXiv:2208.11296, 2022.
  12. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  702–703, 2020.
  13. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  14. The role of pre-training data in transfer learning. arXiv:2302.13602, 2023.
  15. Compressing visual-linguistic model via knowledge distillation. In ICCV, 2021.
  16. Specializing smaller language models towards multi-step reasoning. In ICML, 2023.
  17. Datacomp: In search of the next generation of multimodal datasets. arXiv:2304.14108, 2023.
  18. Bootstrap your own latent - A new approach to self-supervised learning. In NeurIPS, 2020.
  19. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  20. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  21. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226, 2019.
  22. A comprehensive overhaul of feature distillation. In ICCV, 2019.
  23. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
  24. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In ACL, 2023.
  25. Teacher-student architecture for knowledge distillation: A survey. arXiv:2308.04268, 2023.
  26. Openclip. 2021. URL https://doi.org/10.5281/zenodo.5143773.
  27. Segment anything. arXiv:2304.02643, 2023.
  28. Lee, D.-H. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 2013.
  29. Internet explorer: Targeted representation learning on the open web. In ICML, 2023a.
  30. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  31. Distilling large vision-language model with out-of-distribution generalizability. arXiv:2307.03135, 2023c.
  32. Learning customized visual models with retrieval-augmented knowledge. In CVPR, 2023.
  33. Decoupled weight decay regularization. In ICLR, 2019.
  34. Knowledge transfer in vision recognition: A survey. ACM Comput. Surv., 53(2):37:1–37:35, 2021.
  35. Separable self-attention for mobile vision transformers. Trans. Mach. Learn. Res., 2023.
  36. Rangeaugment: Efficient online augmentation with range learning. arXiv:2212.10553, 2022.
  37. A decade survey of transfer learning (2010-2020). IEEE Trans. Artif. Intell., 1(2):151–166, 2020.
  38. Dinov2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
  39. Know your self-supervised learning: A survey on image-based generative and discriminative training. arXiv:2305.13689, 2023.
  40. Relational knowledge distillation. In CVPR, 2019.
  41. Learning transferable visual models from natural language supervision. In ICML, 2021.
  42. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  43. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
  44. DIME-FM: distilling multimodal and efficient foundation models. arXiv:2303.18232, 2023.
  45. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, 2017.
  46. Contrastive representation distillation. In ICLR, 2020.
  47. Tschandl, P. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. 2018. doi: 10.7910/DVN/DBW86T. URL https://doi.org/10.7910/DVN/DBW86T.
  48. Sus-x: Training-free name-only transfer of vision-language models. arXiv:2211.16198, 2022.
  49. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  50. Fastvit: A fast hybrid vision transformer using structural reparameterization. arXiv:2303.14189, 2023a.
  51. Mobileclip: Fast image-text models through multi-modal reinforced training. arXiv preprint arXiv:2311.17049, 2023b.
  52. Interpolation consistency training for semi-supervised learning. Neural Networks, 145:90–106, 2022.
  53. Neural priming for sample-efficient adaptation. arXiv:2306.10191, 2023.
  54. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Trans. Pattern Anal. Mach. Intell., 44(6):3048–3068, 2022.
  55. Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. In ACL, 2023a.
  56. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In CVPR, 2023b.
  57. Tinyclip: CLIP distillation via affinity mimicking and weight inheritance. In ICCV, 2023.
  58. Unsupervised data augmentation for consistency training. In NeurIPS, 2020a.
  59. Self-training with noisy student improves imagenet classification. In CVPR, 2020b.
  60. Cit: Curation in training for effective vision-language data. arXiv:2301.02241, 2023.
  61. CLIP-KD: an empirical study of distilling CLIP models. arXiv:2307.12732, 2023.
  62. Florence: A new foundation model for computer vision. arXiv:2111.11432, 2021.
  63. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
  64. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.
  65. Learning deep features for scene recognition using places database. In NeurIPS, 2014.
  66. Scene parsing through ADE20K dataset. In CVPR, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Raviteja Vemulapalli (29 papers)
  2. Hadi Pouransari (32 papers)
  3. Fartash Faghri (32 papers)
  4. Sachin Mehta (48 papers)
  5. Mehrdad Farajtabar (56 papers)
  6. Mohammad Rastegari (57 papers)
  7. Oncel Tuzel (62 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.