Born Again Neural Networks (1805.04770v2)

Published 12 May 2018 in stat.ML, cs.AI, and cs.LG

Abstract: Knowledge Distillation (KD) consists of transferring “knowledge” from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness, without sacrificing too much performance. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and LLMing tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating the effect of the teacher outputs on both predicted and non-predicted classes.

PDF Abstract

Born-Again Neural Networks: A Summary

The paper "Born-Again Neural Networks" (BANs) contributes a novel perspective on Knowledge Distillation (KD), where student models trained from the teacher models exceed the performance of their teachers. Traditionally, KD has been employed to transfer knowledge from a larger, well-performing model to a more compact model aiming at efficiency without a significant drop in performance. Contrary to this conventional approach, this paper explores scenarios where the student's architecture is identical to the teacher's.

Summary of Key Findings

The surprising discovery of BANs is their consistent outperformance compared to their teachers in various tasks, ranging from computer vision to LLMing. Two experiments on CIFAR datasets using DenseNet architectures demonstrated state-of-the-art validation errors: 3.5% on CIFAR-10 and 15.5% on CIFAR-100.

Distillation Methods

The paper introduces two new distillation objectives:

Confidence-Weighted by Teacher Max (CWTM): Here, the student model is trained using a loss where each sample is weighted by the maximum confidence output by the teacher model.
Dark Knowledge with Permuted Predictions (DKPP): This method permutes all but the argmax outputs from the teacher, aiming to discern the effect of non-argmax outputs in the KD process.

Major Contributions and Results

Parameter Efficiency: The paper revealed that BANs could enhance the performance of models while maintaining the same parameter count. For instance, transferring a DenseNet-90-60 teacher to an identical student achieved an error rate reduction from 17.69% to 16.00% in CIFAR-100.
Sequential Training and Ensembles: The concept of a sequence of teaching selves was explored, showing incremental improvements through multiple generations of training. For example, DenseNet-80-80 models showed reduced error rates through sequential training steps and ensemble methods.
Hypothesis Testing: By employing the CWTM and DKPP methods, the research provided insights into the mechanics of KD. The CWTM method demonstrated that even without dark knowledge, there were performance gains, corroborating the hypothesis that the confidence in a prediction influences learning efficiency.
Model Generalization: Demonstrating BANs with different architectures showed that the technique is robust across various model designs. Notably, DenseNet-trained students improved ResNet architectures, which were originally implemented without significant performance trade-offs.
Application to LLMs: Extending the BAN approach to sequence models like LSTM on the Penn Tree Bank dataset resulted in notable reductions in validation and test perplexity, proving the versatility of the approach beyond image classification tasks.

Implications and Future Directions

The implications of this research are manifold. Practically, the technique offers an efficient method to enhance model performance without increasing model complexity. Theoretically, this paper challenges the assumption that high-capacity models are necessary for optimal performance and opens avenues for research into the dynamics of knowledge transfer and model generalization.

Future Developments: Further investigations might focus on extending BAN techniques to other domains like natural language processing and reinforcement learning. Moreover, exploring the limits of sequential training and its implications for ensembling techniques can be an exciting direction, along with development of new architectures that inherently incorporate BAN methodologies for improved robustness and interpretability.

Overall, the insights drawn from this research represent a significant contribution to the fields of deep learning and model optimization, providing a foundational basis for future explorations in enhancing neural network performance through knowledge distillation.