$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections (2403.06213v1)
Abstract: Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd
- Unitary evolution recurrent neural networks. ICML, 2016.
- VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR, 2022.
- Knowledge distillation: A good teacher is patient and consistent. CVPR, 2022.
- Dream distillation: A data-independent model compression framework. ICML Joint Workshop on On-Device Machine Learning and Compact Deep Neural Network Representations (ODML-CDNNR), 2019.
- Riemannian batch normalization for spd neural networks. NeurIPS, 2019.
- XNOR-Net++: Improved Binary Neural Networks. BMVC, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Wasserstein Contrastive Representation Distillation. CVPR, 2020a.
- Distilling Knowledge via Knowledge Review. CVPR, 2021.
- A simple framework for contrastive learning of visual representations. ICML, 2020b.
- Exploring Simple Siamese Representation Learning. CVPR, 2021.
- Dearkd: Data-efficient early knowledge distillation for vision transformers. CVPR, 2022a.
- Improved Feature Distillation via Projector Ensemble. NeurIPS, 2022b.
- On the efficacy of knowledge distillation. ICCV, 2019.
- Explicit approximations of the gaussian kernel. arXiv preprint, 2011.
- Kd-dlgan: Data limited image generation via knowledge distillation. CVPR, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. ACL, 2019.
- The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Applications, 1998.
- Gdpp: Learning diverse generations using determinantal point process. ICML, 2019.
- Contrastive Model Inversion for Data-Free Knowledge Distillation. IJCAI, 2021.
- Obow: Online bag-of-visual-words generation for self-supervised learning. CVPR, 2021.
- Bootstrap your own latent a new approach to self-supervised learning. NeurIPS, 2020.
- Reducing the Teacher-Student Gap via Spherical Knowledge Distillation. arXiv preprint, 2020a.
- DMCP: Differentiable Markov Channel Pruning for Neural Networks. CVPR, 2020b.
- George Bruce Halsted. The collected mathematical papers of arthur cayley. The American Mathematical Monthly, 1899.
- Learning efficient vision transformers via fine-grained manifold distillation. NeurIPS, 2022.
- Array programming with NumPy. Nature, 2020.
- Feature Kernel Distillation. ICLR, 2022.
- Momentum Contrast for Unsupervised Visual Representation Learning. CVPR, 2020.
- Orthogonal recurrent neural networks with scaled cayley transform. PMLR, 2018.
- A comprehensive overhaul of feature distillation. ICCV, 2019a.
- Knowledge transfer via distillation of activation boundaries formed by hidden neurons. AAAI, 2019b.
- Nicholas J. Higham. The scaling and squaring method for the matrix exponential revisited. SIAM Journal on Matrix Analysis and Applications, 2005.
- Distilling the Knowledge in a Neural Network. NeurIPS, 2015.
- Learning deep representations by mutual information estimation and maximization. ICLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint, 2021.
- Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. arXiv preprint, 2017.
- Distilling Global and Local Logits with Densely Connected Relations. ICCV, 2021.
- Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.
- ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS, 2012.
- Yann Lecun. Optimal Brain Damage. NeurIPS, 1990.
- Local correlation consistency for knowledge distillation. In ECCV, 2020.
- HRank: Filter Pruning using High-Rank Feature Map. CVPR, 2020.
- Microsoft COCO: Common objects in context. ECCV, 2014.
- Microsoft coco: Common objects in context. arXiv preprint, 2015.
- Knowledge distillation via instance relationship graph. CVPR, 2019a.
- Structured Knowledge Distillation for Semantic Segmentation. CVPR, 2019b.
- Unifying distillation and privileged information. ICLR, 2016.
- Decoupled weight decay regularization. ICLR, 2019.
- Cascaded channel pruning using hierarchical self-distillation. BMVC, 2020.
- Understanding the role of the projector in knowledge distillation. AAAI, 2024.
- Information Theoretic Representation Distillation. BMVC, 2022.
- MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation. CVPR, 2023.
- SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation. BMVC, 2021.
- Respecting transfer gap in knowledge distillation. NeurIPS, 2022.
- Relational Knowledge Distillation. CVPR, 2019.
- Learning Deep Representations with Probabilistic Knowledge Transfer. ECCV, 2018.
- Automatic differentiation in PyTorch. NeurIPS 2017 Workshop Autodiff homepage, 2017.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
- Correlation congruence for knowledge distillation. CVPR, 2019.
- Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 2019.
- Controlling text-to-image diffusion by orthogonal finetuning. arXiv preprint, 2023.
- Designing Network Design Spaces. CVPR, 2020.
- Co-advise: Cross Inductive Bias Distillation. CVPR, 2022.
- Byol works even without batch statistics. arXiv preprint, 2020.
- The edge of orthogonality: A simple view of what makes byol tick. arXiv preprint, 2023.
- FitNets: Hints For Thin Deep Nets. ICLR, 2015.
- ImageNet Large Scale Visual Recognition Challenge. IJCV, 2014.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. NeurIPS Workshop on Energy Efficient Machine Learning and Cognitive Computing, 2019.
- Maskedkd: Efficient distillation of vision transformers with masked images. arXiv preprint, 2023.
- Vidt: An efficient and effective fully transformer-based object detector. ICLR, 2021.
- Knowledge transfer with jacobian matching. ICML, 2018.
- Contrastive representation distillation. ICLR, 2019.
- Understanding self-supervised Learning Dynamics without Contrastive Pairs. ICML, 2021.
- Training data-efficient image transformers & distillation through attention. PMLR, 2021a.
- Going deeper with image transformers. In ICCV, 2021b.
- Deit iii: Revenge of the vit. In ECCV, 2022.
- Similarity-preserving knowledge distillation. ICCV, 2019.
- Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint, 2022.
- Knowledge Distillation Meets Self-supervision. ECCV, 2020a.
- Knowledge distillation meets self-supervision. ECCV, 2020b.
- Knowledge distillation via softmax regression representation learning. In ICLR, 2021.
- Knowledge distillation meets open-set semi-supervised learning. arXiv:2205.06701, 2022.
- TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing. ACL, 2020.
- From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. ICCV, 2023.
- Junho Yim. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. CVPR, 2017.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2019.
- Mixup: Beyond empirical risk minimization. ICLR, 2017.
- Wavelet knowledge distillation: Towards efficient image-to-image translation. CVPR, 2022.
- Decoupled Knowledge Distillation. CVPR, 2022.
- Differentiable augmentation for data-efficient gan training. NeurIPS, 2020.
- DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv preprint, 2016.
- Complementary Relation Contrastive Distillation. CVPR, 2021.
- Roy Miles (9 papers)
- Ismail Elezi (28 papers)
- Jiankang Deng (96 papers)