Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation (1905.08094v1)

Published 17 May 2019 in cs.LG and stat.ML

Abstract: Convolutional neural networks have been widely deployed in various application scenarios. In order to extend the applications' boundaries to some accuracy-crucial domains, researchers have been investigating approaches to boost accuracy through either deeper or wider network structures, which brings with them the exponential increment of the computational and storage cost, delaying the responding time. In this paper, we propose a general training framework named self distillation, which notably enhances the performance (accuracy) of convolutional neural networks through shrinking the size of the network rather than aggrandizing it. Different from traditional knowledge distillation - a knowledge transformation methodology among networks, which forces student neural networks to approximate the softmax layer outputs of pre-trained teacher neural networks, the proposed self distillation framework distills knowledge within network itself. The networks are firstly divided into several sections. Then the knowledge in the deeper portion of the networks is squeezed into the shallow ones. Experiments further prove the generalization of the proposed self distillation framework: enhancement of accuracy at average level is 2.65%, varying from 0.61% in ResNeXt as minimum to 4.07% in VGG19 as maximum. In addition, it can also provide flexibility of depth-wise scalable inference on resource-limited edge devices.Our codes will be released on github soon.

Authors (6)

Linfeng Zhang (160 papers)
Jiebo Song (8 papers)
Anni Gao (1 paper)
Jingwei Chen (18 papers)
Chenglong Bao (42 papers)
Kaisheng Ma (46 papers)

Citations (768)

View on Semantic Scholar

Summary

Self Distillation in Convolutional Neural Networks

In Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation, Zhang et al. introduce a novel framework termed self distillation, which significantly enhances the accuracy of convolutional neural networks while remedying the typical computational burden associated with large model architectures. By focusing on internal knowledge transfer within the model, the authors demonstrate improved performance without necessitating deeper or wider networks.

Summary of Contributions

The key contributions of the paper include:

Self Distillation Framework: The proposed methodology partitions a convolutional neural network into multiple sections and then transfers knowledge from the deeper sections (acting as internal teachers) to the shallower ones (students). This is distinct from traditional knowledge distillation, which requires a separate, pre-trained teacher model.
Performance Improvement: An average accuracy boost of 2.65% is reported across various architectures, including a maximum of 4.07% improvement in VGG19.
Adaptability for Edge Devices: The self distillation approach allows for depth-wise scalable inference, offering flexibility for accuracy-efficiency trade-offs on resource-constrained hardware.

Experimental Evaluation

The experimentations conducted utilize five distinct convolutional network architectures (ResNet, WideResNet, Pyramid ResNet, ResNeXt, VGG) over two datasets (CIFAR100 and ImageNet):

On CIFAR100, the self distillation framework raised the average accuracy by 2.65%, with the improvements ranging from a minimum of 0.61% in ResNeXt to a maximum of 4.07% in VGG19.
On ImageNet, the approach achieved an average improvement of 2.02%.
The training time on CIFAR100 was significantly reduced from 26.98 hours to 5.87 hours using ResNet50, and the accuracy increased from 79.33% to 81.04%.

Methodological Insights

Training Mechanism: The self distillation training involves:

Dividing the network into shallower sections.
Adding classifiers with bottleneck and fully connected layers post each section.
Applying three types of losses: standard cross-entropy, KL divergence from the deepest classifier, and L2 loss to align feature maps.

Adaptive Inference: This design supports dynamic inference catering to the needs of limited-resource environments by deploying the network at variable depths. For instance, shallower sections can be used for faster, albeit less accurate, predictions, while deeper sections offer high accuracy at the cost of longer inference times.

Comparative Assessments

The self distillation method shows several advantages over existing techniques:

Comparison with Traditional Distillation: The method obviates the necessity for pre-trained large teacher models, reducing both the training time and the overall computational cost.
Comparison with Deeply Supervised Networks: Incorporating distillation loss in shallow classifiers demonstrates a higher performance boost compared to simply using cross-entropy, indicating better gradient flow and discriminative feature learning.

Theoretical Implications

Flat Minima and Generalization: The paper suggests that self distillation aids in finding flatter minima, which contribute to better generalization performance. This assertion is supported by experiments showing enhanced robustness to parameter perturbations (i.e., Gaussian noise).

Discriminative Features and Gradient Flow: As demonstrated by the Mean Magnitude Gradient analysis, self distillation mitigates the vanishing gradient problem, ensuring consistent gradient flow throughout the network. PCA visualization further attests to the effective discriminative feature learning in deeper sections enabled by this framework.

Future Directions

Despite its demonstrated benefits, numerous areas remain open for further research:

Hyperparameter Optimization: Automatic tuning of the introduced hyperparameters $\lambda$ and $\alpha$ could potentially yield further improvements.
Training Dynamics: Exploring alternating training regimes between self distillation and conventional deep model training could optimize the final convergence stages and possibly enhance performance.

By eliminating the requirement for large pre-trained models and facilitating adaptive, computationally efficient inference, the self distillation framework presented by Zhang et al. offers a promising advance in the training and deployment of convolutional neural networks. The principles outlined here not only suggest practical utility for resource-limited devices but also provide intriguing directions for the theoretical understanding of knowledge transfer within neural networks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/cjm_ai/status/1604509437053796353

YouTube

Show All Videos