Dynamic Data-Free Knowledge Distillation by Easy-to-Hard Learning Strategy (2208.13648v3)
Abstract: Data-free knowledge distillation (DFKD) is a widely-used strategy for Knowledge Distillation (KD) whose training data is not available. It trains a lightweight student model with the aid of a large pretrained teacher model without any access to training data. However, existing DFKD methods suffer from inadequate and unstable training process, as they do not adjust the generation target dynamically based on the status of the student model during learning. To address this limitation, we propose a novel DFKD method called CuDFKD. It teaches students by a dynamic strategy that gradually generates easy-to-hard pseudo samples, mirroring how humans learn. Besides, CuDFKD adapts the generation target dynamically according to the status of student model. Moreover, We provide a theoretical analysis of the majorization minimization (MM) algorithm and explain the convergence of CuDFKD. To measure the robustness and fidelity of DFKD methods, we propose two more metrics, and experiments shows CuDFKD has comparable performance to state-of-the-art (SOTA) DFKD methods on all datasets. Experiments also present that our CuDFKD has the fastest convergence and best robustness over other SOTA DFKD methods.
- Distilling from professors: Enhancing the knowledge distillation of teachers. Information Sciences 576, 743–755.
- Knowledge distillation for low-power object detection: A simple technique and its extensions for training compact models using unlabeled data, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 769–778.
- Curriculum learning, in: Proceedings of the 26th annual international conference on machine learning, pp. 41–48.
- Robust and resource-efficient data-free knowledge distillation by generative pseudo replay. arXiv preprint arXiv:2201.03019 .
- Preventing catastrophic forgetting and distribution mismatch in knowledge distillation via synthetic data, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 663–671.
- Monte carlo and quasi-monte carlo methods. Acta numerica 7, 1–49.
- Data-free learning of student networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3514–3522.
- Qimera: Data-free quantization with synthetic boundary supporting samples. Advances in Neural Information Processing Systems 34.
- Data-free network quantization with adversarial knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 710–711.
- Graph-free knowledge distillation for graph neural networks. arXiv preprint arXiv:2105.07519 .
- Mosaicking to distill: Knowledge distillation from out-of-domain data. Advances in Neural Information Processing Systems 34, 11920–11932.
- Up to 100x faster data-free knowledge distillation, in: AAAI Conference on Artificial Intelligence.
- Data-free adversarial distillation. arXiv preprint arXiv:1912.11006 .
- Contrastive model inversion for data-free knowledge distillation. arXiv preprint arXiv:2105.08584 .
- Self-paced co-training of graph neural networks for semi-supervised node classification. IEEE Transactions on Neural Networks and Learning Systems .
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 .
- Learning multiple layers of features from tiny images .
- Self-paced learning for latent variable models. Advances in neural information processing systems 23.
- Tiny imagenet visual recognition challenge. CS 231N 7, 3.
- Dynamic knowledge distillation for pre-trained language models. arXiv preprint arXiv:2109.11295 .
- Unsupervised feature selection via self-paced learning and low-redundant regularization. Knowledge-Based Systems 240, 108150.
- Large-scale generative data-free distillation. arXiv preprint arXiv:2012.05578 .
- On convergence properties of implicit self-paced objective. Information Sciences 462, 132–140.
- Target layer regularization for continual learning using cramer-wold distance. Information Sciences 609, 1369–1380.
- A theoretical understanding of self-paced learning. Information Sciences 414, 319–328.
- Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems 32.
- Improved knowledge distillation via teacher assistant, in: Proceedings of the AAAI conference on artificial intelligence, pp. 5191–5198.
- Zero-shot knowledge distillation in deep networks, in: International Conference on Machine Learning, PMLR. pp. 4743–4751.
- Black-box few-shot knowledge distillation, in: European Conference on Computer Vision.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
- Curriculum learning: A survey. International Journal of Computer Vision , 1–40.
- Does knowledge distillation really work? Advances in Neural Information Processing Systems 34, 6906–6919.
- Data-free model extraction. arXiv preprint arXiv:2011.14779 .
- Data-free model extraction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4771–4780.
- Online adversarial distillation for graph neural networks. arXiv preprint arXiv:2112.13966 .
- Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence .
- A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence .
- Data-free knowledge distillation with soft targeted transfer set synthesis. arXiv preprint arXiv:2104.04868 .
- Zero-shot knowledge distillation from a decision-based black-box model, in: International Conference on Machine Learning, PMLR. pp. 10675–10685.
- Learning to prompt for continual learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 139–149.
- Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification, in: European Conference on Computer Vision, Springer. pp. 247–263.
- Dreaming to distill: Data-free knowledge transfer via deepinversion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8715–8724.
- Deep anomaly discovery from unlabeled videos via normality advantage and self-paced refinement, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13987–13998.
- Wide residual networks. arXiv preprint arXiv:1605.07146 .
- Qekd: Query-efficient and data-free knowledge distillation from black-box models. arXiv preprint arXiv:2205.11158 .
- Spaks: Self-paced multiple kernel subspace clustering with feature smoothing regularization. Knowledge-Based Systems 253, 109500.
- Data-free knowledge distillation for image super-resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7852–7861.
- Decoupled knowledge distillation. arXiv preprint arXiv:2203.08679 .
- Uncertainty-aware curriculum learning for neural machine translation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6934–6944.