Self-Knowledge Distillation with Progressive Refinement of Targets (2006.12000v3)

Published 22 Jun 2020 in cs.LG and stat.ML

Abstract: The generalization capability of deep neural networks has been substantially improved by applying a wide spectrum of regularization methods, e.g., restricting function space, injecting randomness during training, augmenting data, etc. In this work, we propose a simple yet effective regularization method named progressive self-knowledge distillation (PS-KD), which progressively distills a model's own knowledge to soften hard targets (i.e., one-hot vectors) during training. Hence, it can be interpreted within a framework of knowledge distillation as a student becomes a teacher itself. Specifically, targets are adjusted adaptively by combining the ground-truth and past predictions from the model itself. We show that PS-KD provides an effect of hard example mining by rescaling gradients according to difficulty in classifying examples. The proposed method is applicable to any supervised learning tasks with hard targets and can be easily combined with existing regularization methods to further enhance the generalization performance. Furthermore, it is confirmed that PS-KD achieves not only better accuracy, but also provides high quality of confidence estimates in terms of calibration as well as ordinal ranking. Extensive experimental results on three different tasks, image classification, object detection, and machine translation, demonstrate that our method consistently improves the performance of the state-of-the-art baselines. The code is available at https://github.com/lgcnsai/PS-KD-Pytorch.

Citations (152)

View on Semantic Scholar

Summary

The paper introduces PS-KD, a self-distillation method that leverages past epoch predictions as soft targets instead of using separate teacher models.
It refines hard one-hot labels by linearly combining them with prior predictions, effectively focusing on difficult examples through adaptive gradient scaling.
Experimental results across tasks like CIFAR-100, ImageNet, and machine translation show PS-KD outperforms traditional label smoothing and other self-distillation techniques.

Self-Knowledge Distillation with Progressive Refinement of Targets

The paper introduces a novel regularization technique termed Progressive Self-Knowledge Distillation (PS-KD), addressing key challenges in improving the generalization capabilities of deep neural networks (DNNs). Unlike traditional approaches that require separate teacher models for knowledge distillation, PS-KD leverages the model's own predictions from previous epochs as soft targets, facilitating a student-teaches-itself framework.

Method Overview

At the core of PS-KD is the adaptive refinement of training targets. Hard targets, typically one-hot vectors, are progressively softened through a linear combination with the model's own predictions from the previous epoch. This self-guidance framework aligns with hard example mining principles, where gradients are rescaled according to example difficulty, efficiently focusing learning efforts on more challenging samples. The method ensures easy integration with existing regularization techniques, enhancing performance across diverse supervised tasks.

Experimental Analysis

The efficacy of PS-KD is rigorously evaluated across multiple domains: image classification on CIFAR-100 and ImageNet, object detection on PASCAL VOC, and machine translation on datasets like IWSLT15 and Multi30k. Results consistently show superior performance over conventional label smoothing and contemporary self-distillation techniques such as CS-KD and TF-KD. Specifically, on CIFAR-100, PS-KD outperforms baseline and peers in both accuracy and confidence calibration measures, illustrating its robustness and adaptability. Augmenting PS-KD with advanced regularization strategies like CutMix further amplifies these advantages, presenting a compelling case for using self-derived soft targets in model training.

Implications and Future Directions

The theoretical contributions of the work are significant, proposing a method that inherently adapts the learning focus based on sample difficulty via dynamic gradient scaling. This insight opens avenues for research into more sophisticated self-teaching frameworks within neural networks. The PS-KD strategy poses substantial implications for reducing model overfitting and enhancing confidence estimation without the need for additional parameters.

Going forward, exploring variations of the PS-KD approach that consider immediate past predictions or integrating more complex adaptive target strategies could further enhance model robustness and scalability. The potential for PS-KD to reduce computational overhead by eliminating the need for separate teacher models has practical benefits, especially in scenarios with constrained resources.

In conclusion, Progressive Self-Knowledge Distillation stands as a robust alternative to conventional knowledge distillation methodologies, offering both theoretical and empirical advancements in maximizing the generalization potential of deep learning models.

PDF Markdown

Related Papers

YouTube

Show All Videos