A Comprehensive Overhaul of Feature Distillation (1904.01866v2)

Published 3 Apr 2019 in cs.CV and cs.LG

Abstract: We investigate the design aspects of feature distillation methods achieving network compression and propose a novel feature distillation method in which the distillation loss is designed to make a synergy among various aspects: teacher transform, student transform, distillation feature position and distance function. Our proposed distillation loss includes a feature transform with a newly designed margin ReLU, a new distillation feature position, and a partial L2 distance function to skip redundant information giving adverse effects to the compression of student. In ImageNet, our proposed method achieves 21.65% of top-1 error with ResNet50, which outperforms the performance of the teacher network, ResNet152. Our proposed method is evaluated on various tasks such as image classification, object detection and semantic segmentation and achieves a significant performance improvement in all tasks. The code is available at https://sites.google.com/view/byeongho-heo/overhaul

Citations (516)

View on Semantic Scholar

Summary

The paper introduces a novel feature distillation methodology that reforms the loss design with margin ReLU and partial L2 functions.
It demonstrates significant improvements, achieving a top-1 error rate of 21.65% on ImageNet with ResNet50 over the teacher ResNet152.
The approach enhances model compression and flexibility, making it effective for tasks like object detection and semantic segmentation.

A Comprehensive Overhaul of Feature Distillation

The paper presents an innovative approach to feature distillation aimed at enhancing the performance and compression of neural networks. The authors introduce a novel feature distillation method by analyzing key design considerations, including teacher and student transformations, feature positions for distillation, and distance functions. This new methodology seeks to optimize the transfer of knowledge from teacher models to student models, achieving notable improvements across various computer vision tasks.

Distillation Loss Design

Central to the proposed method is the reformulation of the distillation loss. This involves integrating a feature transform using a newly designed margin ReLU, strategically selecting a distillation feature position, and employing a partial $L_2$ distance function. The design effectively manages the trade-off between leveraging important information and excluding redundant data that may hinder student network compression.

Experimental Results

The paper demonstrates strong empirical results, most notably on the ImageNet dataset, where the proposed method achieves a top-1 error rate of 21.65% with ResNet50, surpassing the performance of the teacher network ResNet152. Additionally, performance improvements are consistently observed across other tasks such as object detection and semantic segmentation.

Design Aspects and Their Impact

The method's design is dissected across four dimensions:

Teacher Transform: Utilizing margin ReLU to handle positive and negative features more adeptly. This transform aims to reduce adverse information transferred from the teacher by applying a negative margin, thus refining the learning process of the student.
Student Transform: An asymmetric transformation approach using a 1x1 convolutional layer aligns the student’s features without information loss seen in traditional dimensional reduction.
Distillation Feature Position: The choice of the pre-ReLU position as the point for feature distillation maximizes the preservation of information, contrasting with other methods that often overlook the non-linear activation phase.
Distance Function: The introduction of the partial $L_2$ distance function ensures that only informative positive responses are transferred, skipping superfluous negative responses.

Practical and Theoretical Implications

The proposed feature distillation framework not only enhances model compression but also provides a higher architectural flexibility allowing distillation between structurally diverse networks. This advancement poses significant implications for future developments in AI where model size and efficiency are crucial, such as deploying machine learning models on resource-constrained devices.

Speculation on Future Developments

The current innovative approach lays the groundwork for further research in efficient network training and adaptation. It suggests exploring more sophisticated teacher-student configurations, and potentially integrating these strategies with other compression techniques like pruning or quantization for compounded benefits.

Conclusion

This paper contributes a significant advancement in feature distillation techniques, marked by holistic design improvements and substantial empirical validation. The proposed methodology not only outperforms existing methods but also establishes a robust framework adaptable to various architectures and tasks in neural network contexts. Such an approach can be pivotal in the ongoing quest for optimizing neural networks for both performance and efficiency.

PDF Markdown