Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TVT: Training-Free Vision Transformer Search on Tiny Datasets (2311.14337v1)

Published 24 Nov 2023 in cs.CV

Abstract: Training-free Vision Transformer (ViT) architecture search is presented to search for a better ViT with zero-cost proxies. While ViTs achieve significant distillation gains from CNN teacher models on small datasets, the current zero-cost proxies in ViTs do not generalize well to the distillation training paradigm according to our experimental observations. In this paper, for the first time, we investigate how to search in a training-free manner with the help of teacher models and devise an effective Training-free ViT (TVT) search framework. Firstly, we observe that the similarity of attention maps between ViT and ConvNet teachers affects distill accuracy notably. Thus, we present a teacher-aware metric conditioned on the feature attention relations between teacher and student. Additionally, TVT employs the L2-Norm of the student's weights as the student-capability metric to improve ranking consistency. Finally, TVT searches for the best ViT for distilling with ConvNet teachers via our teacher-aware metric and student-capability metric, resulting in impressive gains in efficiency and effectiveness. Extensive experiments on various tiny datasets and search spaces show that our TVT outperforms state-of-the-art training-free search methods. The code will be released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. “Localvit: Bringing locality to vision transformers,” ArXiv, vol. abs/2104.05707, 2021.
  2. “Neural architecture search without training,” in ICML, 2021.
  3. “Training-free transformer architecture search,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10894–10903.
  4. “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning, 2021.
  5. “Efficient vision transformers via fine-grained manifold distillation,” ArXiv, vol. abs/2107.01378, 2021.
  6. “Tinyvit: Fast pretraining distillation for small vision transformers,” ArXiv, vol. abs/2207.10666, 2022.
  7. “Norm: Knowledge distillation via n-to-one representation matching,” arXiv preprint arXiv:2305.13803, 2023.
  8. “Automated knowledge distillation via monte carlo tree search,” in ICCV, 2023.
  9. “Shadow knowledge distillation: Bridging offline and online knowledge transfer,” in NeuIPS, 2022.
  10. “Kd-zero: Evolving knowledge distiller for any teacher-student pairs,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  11. “Teacher-free distillation via regularizing intermediate representation,” in IJCNN, 2022.
  12. “Boosting online feature transfer via separable feature fusion.,” in IJCNN, 2022.
  13. “Explicit connection distillation,” 2020.
  14. Lujun Li, “Self-regulated feature learning via teacher-free feature distillation,” in ECCV, 2022.
  15. “Catch-up distillation: You only need to train once for accelerating sampling,” arXiv preprint arXiv:2305.10769, 2023.
  16. “Locality guidance for improving vision transformers on tiny datasets,” in European Conference on Computer Vision. Springer, 2022, pp. 110–127.
  17. “Auto-prox: Training-free vision transformer architecture search via automatic proxy discovery,” AAAI, 2024.
  18. “Improving one-shot nas with shrinking-and-expanding supernet,” Pattern Recognition, 2021.
  19. “Prior-guided one-shot neural architecture search,” arXiv preprint arXiv:2206.13329, 2022.
  20. “Gp-nas-ensemble: a model for the nas performance prediction,” in CVPRW, 2022.
  21. “Diswot: Student architecture search for distillation without training,” in CVPR, 2023.
  22. “Emq: Evolving training-free proxies for automated mixed precision quantization,” arXiv preprint arXiv:2307.10554, 2023.
  23. “Pruning neural networks without any data by iteratively conserving synaptic flow,” NeurIPS, 2020.
  24. “Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective,” in ICLR, 2020.
  25. “Econas: Finding proxies for economical neural architecture search,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  26. “Efficient neural architecture search via parameter sharing,” in ICML, 2018.
  27. “Hard sample aware noise robust learning for histopathology image classification,” IEEE Transactions on Medical Imaging, vol. 41, no. 4, pp. 881–894, 2021.
  28. “Zero-cost proxies for lightweight nas,” in ICLR, 2020.
  29. “Snip: Single-shot network pruning based on connection sensitivity,” arXiv preprint arXiv:1810.02340, 2018.
  30. “Autoformer: Searching transformers for visual recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12270–12280.
  31. “Picking winning tickets before training by preserving gradient flow,” arXiv preprint arXiv:2002.07376, 2020.
  32. “Similarity-preserving knowledge distillation,” in ICCV, 2019.
  33. “Like what you like: Knowledge distill via neuron selectivity transfer,” arXiv:1707.01219, 2017.
Citations (6)

Summary

We haven't generated a summary for this paper yet.