Subclass Distillation (2002.03936v2)

Published 10 Feb 2020 in cs.LG and stat.ML

Abstract: After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.

Authors (3)

Rafael Müller (3 papers)
Simon Kornblith (53 papers)
Geoffrey Hinton (38 papers)

Citations (33)

View on Semantic Scholar

Summary

An Analytical Exploration of Subclass Distillation

The paper "Subclass Distillation" provides an in-depth examination of how knowledge distillation can be enhanced by introducing subclasses within a neural network training framework. This paper is particularly pivotal for binary classification or other limited class tasks where conventional distillation often suffers from information scarcity. The authors propose a method where, instead of transferring the probabilities of the teacher's classes directly, the teacher network is trained to create and assign probabilities to subclasses, which the student network then seeks to emulate.

Core Contributions

The main contribution of the work is the introduction of subclass distillation, a method designed to augment the conventional distillation process by leveraging artificially created subclasses. By explicitly forcing the teacher model to invent subclasses for each class during supervised training, the subsequent subclass probabilities provide additional information to the student model. Notably, this subclass approach is shown to be advantageous over classical distillation and penultimate layer distillation in tasks with few classes.

Additionally, the work explores an auxiliary loss that encourages the network to utilize a diverse set of subclasses. This methodology ensures that the subclass predictions are not only confident but diverse, acting as a regularization mechanism that guides each prediction to be "peaky," thus enhancing the semantic richness of the learned representations.

Methodology and Results

The approach is systematically tested using a hybrid of standard datasets such as CIFAR-10, CelebA, and Criteo click prediction data. One particularly illustrative experiment is with a binary split of CIFAR-10 (CIFAR-2x5), where classes are grouped to create a binary task. The ResNet-20 teacher network trained with subclass information achieves superior binary classification performance compared to a standard training approach, evidencing the efficacy of subclass distillation. The findings reveal that subclass distances achieve better subclass discovery and clustering capability, aiding in faster training and higher final accuracy. Notably, subclass distillation outperforms conventional and penultimate layer distillation schemes, achieving significant accuracy improvements.

Further empirical results on CelebA demonstrate that the subclass structure learned by the teacher correlates tightly with known semantic attributes, underscoring the interpretability component embedded within this method. Likewise, in the Criteo dataset, which presents a more challenging real-world application, subclass distillation not only accelerates training but also exhibits improved generalization when data availability is constrained, a salient attribute for large-scale applications.

Theoretical and Practical Implications

The theoretical underpinning supporting subclass distillation rests on the premise that subclass probabilities retain more nuanced information than class probabilities alone. The auxiliary loss term effectively distributes examples across subclasses, acting as an implicit teacher, which not only segments the data but enriches the learning process through these finer-grained outputs.

In practical terms, subclass distillation can significantly benefit applications constrained by computational resources, such as mobile and embedded systems, where models must be both small and potent. The technology offers a pathway to streamline model deployment in resource-constrained environments while retaining high accuracy and robust generalization.

Future Developments

The prospect of subclass distillation opens new avenues in both theoretical research and application-based advancements. Future work may focus on refining algorithmic efficiency, exploring subclass creation criteria, and assessing subclass distillation's interplay with other model compression techniques such as quantization and pruning. Moreover, broader application across multimodal datasets could unlock new potentials for neural network interpretability and annotation efficacy.

In summary, the paper articulates a compelling case for subclass distillation as an essential evolution of the distillation paradigm, poised to tackle the intricacies of limited class scenarios by leveraging the richness of learned subclass relationships. This paradigmatic shift not only outlines a clear path forward for efficient model training but also sets a precedent for subsequent exploration in neural network compression techniques.

PDF Markdown

Related Papers

Deep Mutual Learning (2017)
Does Knowledge Distillation Really Work? (2021)
Distilling Double Descent (2021)
KTAN: Knowledge Transfer Adversarial Network (2018)
Synthetic data generation method for data-free knowledge distillation in regression neural networks (2023)

YouTube

Show All Videos