Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels (2402.10482v2)
Abstract: We investigate the mechanisms of self-distillation in multi-class classification, particularly in the context of linear probing with fixed feature extractors where traditional feature learning explanations do not apply. Our theoretical analysis reveals that multi-round self-distillation effectively performs label averaging among instances with high feature correlations, governed by the eigenvectors of the Gram matrix derived from input features. This process leads to clustered predictions and improved generalization, mitigating the impact of label noise by reducing the model's reliance on potentially corrupted labels. We establish conditions under which multi-round self-distillation achieves 100% population accuracy despite label noise. Furthermore, we introduce a novel, efficient single-round self-distillation method using refined partial labels from the teacher's top two softmax outputs, referred to as the PLL student model. This approach replicates the benefits of multi-round distillation in a single round, achieving comparable or superior performance--especially in high-noise scenarios--while significantly reducing computational cost.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2022.
- Ensemble knowledge distillation for learning improved and efficient networks. arXiv preprint arXiv:1909.08097, 2019.
- Distilling knowledge from ensembles of neural networks for speech recognition. In Interspeech, pp. 3439–3443, 2016.
- Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5008–5017, 2021.
- Learning from partial labels. The Journal of Machine Learning Research, 12:1501–1536, 2011.
- Knowledge distillation across ensembles of multilingual models for low-resource languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4825–4829. IEEE, 2017.
- Understanding self-distillation in the presence of label noise. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Distillation ≈\approx≈ early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network. arXiv preprint arXiv:1910.01255, 2019.
- Born again neural networks. In International Conference on Machine Learning, pp. 1607–1616. PMLR, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher. Advances in Neural Information Processing Systems, 33:20823–20833, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Arnet: Automatic refinement network for noisy partial label learning. arXiv preprint arXiv:2211.04774, 2022.
- Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019.
- Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
- Progressive identification of true labels for partial-label learning. In international conference on machine learning, pp. 6500–6510. PMLR, 2020.
- Gm-pll: Graph matching based partial label learning. IEEE Transactions on Knowledge and Data Engineering, 33(2):521–535, 2019.
- Information theoretic representation distillation. arXiv preprint arXiv:2112.00459, 2021.
- Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
- Towards understanding knowledge distillation. In International conference on machine learning, pp. 5142–5151. PMLR, 2019.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Contrastive representation distillation. In International Conference on Learning Representations, 2019.
- Adaptive graph guided disambiguation for partial label learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 83–91, 2019.
- Pico: Contrastive label disambiguation for partial label learning. In International Conference on Learning Representations, 2021.
- Pico+: Contrastive label disambiguation for robust partial label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
- Improving knowledge distillation via regularizing feature norm and direction. arXiv preprint arXiv:2305.17007, 2023b.
- Dali: Dynamically adjusted label importance for noisy partial label learning. arXiv preprint arXiv:2301.12077, 2023.
- Partial label learning via label enhancement. In Proceedings of the AAAI Conference on artificial intelligence, volume 33, pp. 5557–5564, 2019.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3713–3722, 2019.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11953–11962, 2022.