Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections (2403.06213v1)

Published 10 Mar 2024 in cs.CV and cs.AI

Abstract: Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Unitary evolution recurrent neural networks. ICML, 2016.
  2. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR, 2022.
  3. Knowledge distillation: A good teacher is patient and consistent. CVPR, 2022.
  4. Dream distillation: A data-independent model compression framework. ICML Joint Workshop on On-Device Machine Learning and Compact Deep Neural Network Representations (ODML-CDNNR), 2019.
  5. Riemannian batch normalization for spd neural networks. NeurIPS, 2019.
  6. XNOR-Net++: Improved Binary Neural Networks. BMVC, 2019.
  7. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  8. Wasserstein Contrastive Representation Distillation. CVPR, 2020a.
  9. Distilling Knowledge via Knowledge Review. CVPR, 2021.
  10. A simple framework for contrastive learning of visual representations. ICML, 2020b.
  11. Exploring Simple Siamese Representation Learning. CVPR, 2021.
  12. Dearkd: Data-efficient early knowledge distillation for vision transformers. CVPR, 2022a.
  13. Improved Feature Distillation via Projector Ensemble. NeurIPS, 2022b.
  14. On the efficacy of knowledge distillation. ICCV, 2019.
  15. Explicit approximations of the gaussian kernel. arXiv preprint, 2011.
  16. Kd-dlgan: Data limited image generation via knowledge distillation. CVPR, 2023.
  17. Bert: Pre-training of deep bidirectional transformers for language understanding. ACL, 2019.
  18. The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Applications, 1998.
  19. Gdpp: Learning diverse generations using determinantal point process. ICML, 2019.
  20. Contrastive Model Inversion for Data-Free Knowledge Distillation. IJCAI, 2021.
  21. Obow: Online bag-of-visual-words generation for self-supervised learning. CVPR, 2021.
  22. Bootstrap your own latent a new approach to self-supervised learning. NeurIPS, 2020.
  23. Reducing the Teacher-Student Gap via Spherical Knowledge Distillation. arXiv preprint, 2020a.
  24. DMCP: Differentiable Markov Channel Pruning for Neural Networks. CVPR, 2020b.
  25. George Bruce Halsted. The collected mathematical papers of arthur cayley. The American Mathematical Monthly, 1899.
  26. Learning efficient vision transformers via fine-grained manifold distillation. NeurIPS, 2022.
  27. Array programming with NumPy. Nature, 2020.
  28. Feature Kernel Distillation. ICLR, 2022.
  29. Momentum Contrast for Unsupervised Visual Representation Learning. CVPR, 2020.
  30. Orthogonal recurrent neural networks with scaled cayley transform. PMLR, 2018.
  31. A comprehensive overhaul of feature distillation. ICCV, 2019a.
  32. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. AAAI, 2019b.
  33. Nicholas J. Higham. The scaling and squaring method for the matrix exponential revisited. SIAM Journal on Matrix Analysis and Applications, 2005.
  34. Distilling the Knowledge in a Neural Network. NeurIPS, 2015.
  35. Learning deep representations by mutual information estimation and maximization. ICLR, 2019.
  36. Lora: Low-rank adaptation of large language models. arXiv preprint, 2021.
  37. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. arXiv preprint, 2017.
  38. Distilling Global and Local Logits with Densely Connected Relations. ICCV, 2021.
  39. Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.
  40. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS, 2012.
  41. Yann Lecun. Optimal Brain Damage. NeurIPS, 1990.
  42. Local correlation consistency for knowledge distillation. In ECCV, 2020.
  43. HRank: Filter Pruning using High-Rank Feature Map. CVPR, 2020.
  44. Microsoft COCO: Common objects in context. ECCV, 2014.
  45. Microsoft coco: Common objects in context. arXiv preprint, 2015.
  46. Knowledge distillation via instance relationship graph. CVPR, 2019a.
  47. Structured Knowledge Distillation for Semantic Segmentation. CVPR, 2019b.
  48. Unifying distillation and privileged information. ICLR, 2016.
  49. Decoupled weight decay regularization. ICLR, 2019.
  50. Cascaded channel pruning using hierarchical self-distillation. BMVC, 2020.
  51. Understanding the role of the projector in knowledge distillation. AAAI, 2024.
  52. Information Theoretic Representation Distillation. BMVC, 2022.
  53. MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation. CVPR, 2023.
  54. SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation. BMVC, 2021.
  55. Respecting transfer gap in knowledge distillation. NeurIPS, 2022.
  56. Relational Knowledge Distillation. CVPR, 2019.
  57. Learning Deep Representations with Probabilistic Knowledge Transfer. ECCV, 2018.
  58. Automatic differentiation in PyTorch. NeurIPS 2017 Workshop Autodiff homepage, 2017.
  59. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  60. Correlation congruence for knowledge distillation. CVPR, 2019.
  61. Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 2019.
  62. Controlling text-to-image diffusion by orthogonal finetuning. arXiv preprint, 2023.
  63. Designing Network Design Spaces. CVPR, 2020.
  64. Co-advise: Cross Inductive Bias Distillation. CVPR, 2022.
  65. Byol works even without batch statistics. arXiv preprint, 2020.
  66. The edge of orthogonality: A simple view of what makes byol tick. arXiv preprint, 2023.
  67. FitNets: Hints For Thin Deep Nets. ICLR, 2015.
  68. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2014.
  69. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. NeurIPS Workshop on Energy Efficient Machine Learning and Cognitive Computing, 2019.
  70. Maskedkd: Efficient distillation of vision transformers with masked images. arXiv preprint, 2023.
  71. Vidt: An efficient and effective fully transformer-based object detector. ICLR, 2021.
  72. Knowledge transfer with jacobian matching. ICML, 2018.
  73. Contrastive representation distillation. ICLR, 2019.
  74. Understanding self-supervised Learning Dynamics without Contrastive Pairs. ICML, 2021.
  75. Training data-efficient image transformers & distillation through attention. PMLR, 2021a.
  76. Going deeper with image transformers. In ICCV, 2021b.
  77. Deit iii: Revenge of the vit. In ECCV, 2022.
  78. Similarity-preserving knowledge distillation. ICCV, 2019.
  79. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
  80. Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint, 2022.
  81. Knowledge Distillation Meets Self-supervision. ECCV, 2020a.
  82. Knowledge distillation meets self-supervision. ECCV, 2020b.
  83. Knowledge distillation via softmax regression representation learning. In ICLR, 2021.
  84. Knowledge distillation meets open-set semi-supervised learning. arXiv:2205.06701, 2022.
  85. TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing. ACL, 2020.
  86. From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. ICCV, 2023.
  87. Junho Yim. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. CVPR, 2017.
  88. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2019.
  89. Mixup: Beyond empirical risk minimization. ICLR, 2017.
  90. Wavelet knowledge distillation: Towards efficient image-to-image translation. CVPR, 2022.
  91. Decoupled Knowledge Distillation. CVPR, 2022.
  92. Differentiable augmentation for data-efficient gan training. NeurIPS, 2020.
  93. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv preprint, 2016.
  94. Complementary Relation Contrastive Distillation. CVPR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Roy Miles (9 papers)
  2. Ismail Elezi (28 papers)
  3. Jiankang Deng (96 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets