Similarity-Preserving Knowledge Distillation (1907.09682v2)

Published 23 Jul 2019 in cs.CV

Abstract: Knowledge distillation is a widely applicable technique for training a student neural network under the guidance of a trained teacher network. For example, in neural network compression, a high-capacity teacher is distilled to train a compact student; in privileged learning, a teacher trained with privileged data is distilled to train a student without access to that data. The distillation loss determines how a teacher's knowledge is captured and transferred to the student. In this paper, we propose a new form of knowledge distillation loss that is inspired by the observation that semantically similar inputs tend to elicit similar activation patterns in a trained network. Similarity-preserving knowledge distillation guides the training of a student network such that input pairs that produce similar (dissimilar) activations in the teacher network produce similar (dissimilar) activations in the student network. In contrast to previous distillation methods, the student is not required to mimic the representation space of the teacher, but rather to preserve the pairwise similarities in its own representation space. Experiments on three public datasets demonstrate the potential of our approach.

Authors (2)

Frederick Tung (26 papers)
Greg Mori (65 papers)

Citations (892)

View on Semantic Scholar

Summary

The paper introduces a method that preserves pairwise activation similarities between teacher and student networks instead of replicating full representations.
It employs a Frobenius norm-based loss to align similarity matrices, demonstrating improved performance on CIFAR-10, CINIC-10, and other datasets.
Experimental results show that SPKD reduces error rates by 7-14% and enhances model transfer learning, validating its practical and theoretical benefits.

Similarity-Preserving Knowledge Distillation: An Academic Summary

Knowledge distillation is a prevalent technique employed to train a "student" neural network using the latent knowledge captured by a "teacher" network. This can optimize neural network performance in various applications such as model compression and privileged learning. This paper introduces a novel knowledge distillation approach termed Similarity-Preserving Knowledge Distillation (SPKD). The method emphasizes preserving activation similarity patterns rather than mimicking the teacher's representation space.

Conceptual Framework

The central aspect of SPKD is that semantically related inputs typically invoke corresponding activation patterns in a trained neural network, captured in matrices of pairwise similarity. Unlike traditional methods, SPKD directs the student network to conserve these similarity relationships, preserving input pairwise similarity or dissimilarity between teacher and student networks. The proposal does not necessitate the student to replicate the teacher's representation space but to reflect similar relational structures in its activation matrices.

Methodology

To illustrate SPKD, consider the following:

Similarity Matrices Construction: Given a mini-batch of input images, the networks compute output activation maps. From these, pairwise similarity matrices are derived by computing normalized outer products of the activations.
Distillation Loss: The loss function is formulated to minimize the Frobenius norm between the similarity matrices of the student and teacher networks.
Training Mechanism: The total loss function integrates this similarity-preserving loss with the conventional cross-entropy loss used for supervised classification.

Experimental Validation

Three extensive experiments underscore the merits of SPKD:

CIFAR-10 Dataset: The performance on CIFAR-10 demonstrated that student networks trained with SPKD outperform conventional training approaches consistently. Notably, this involved various combinations of teacher and student network architectures, confirming the approach's flexibility.
Transfer Learning: When applied to the Describable Textures Dataset (DTD), SPKD improved upon standard fine-tuning methods. This indicates the model’s capacity to transfer learned representations across different visual domains effectively.
CINIC-10 Dataset: Experiments on the CINIC-10 dataset validated the robust performance of SPKD, revealing that it complements other methods like Attention Transfer (AT) and can be combined for enhanced performance.

Results and Insights

The experimental results offer substantial quantitative evidence:

On CIFAR-10, the student networks saw error rates decrease by 7-14% relative compared to conventional training.
In the transfer learning context, SPKD combined with fine-tuning reduced absolute error by around 1%.
The validation on CINIC-10 highlighted the complementary nature of SPKD, lowering error more effectively when combined with other distillation techniques.

Implications and Future Directions

The proposed SPKD method represents a significant evolution in distillation methodologies. It mitigates the dependency on the representation space of the teacher network, thereby offering practical and theoretical advancements:

Practical Implications: Enhanced training of compact models without increasing computational complexity underscores SPKD’s utility in real-world applications like mobile and embedded systems.
Theoretical Insights: The focus on activation similarity relationships paves the way for future research in representation learning and knowledge transfer techniques.

Potential future research trajectories include the adoption of SPKD in semi-supervised learning, leveraging unlabelled data to distill knowledge further. This promises to expand SPKD's applications in scenarios where annotated data is scarce but auxiliary data is plentiful.

Conclusion

The SPKD methodology proposed in this paper introduces a pivotal shift in knowledge distillation practices, emphasizing the preservation of activation similarities. Experimental results validated across multiple datasets illustrate its advantageous impact. This paper's contributions offer a foundation for both continued research in neural network distillation and practical deployment of resource-efficient deep learning models.

PDF Markdown