Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive Representation Distillation (1910.10699v3)

Published 23 Oct 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. Code: http://github.com/HobbitLong/RepDistiller.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yonglong Tian (32 papers)
  2. Dilip Krishnan (36 papers)
  3. Phillip Isola (84 papers)
Citations (977)

Summary

  • The paper introduces a novel contrastive learning-based distillation objective that captures structured, high-order dependencies in neural representations.
  • It demonstrates significant performance gains in tasks including model compression, cross-modal transfer, and ensemble distillation over traditional KD methods.
  • Experimental results highlight improvements such as a 57% relative gain on CIFAR-100 and notable accuracy boosts on ImageNet and multi-modal tasks.

Contrastive Representation Distillation

In the paper "Contrastive Representation Distillation", the authors Yonglong Tian, Dilip Krishnan, and Phillip Isola present a novel framework for transferring representational knowledge between deep neural networks. The central premise is that the standard approach to knowledge distillation, which typically minimizes the KL divergence between the probabilistic outputs of a teacher and student network, is insufficient for capturing the structured knowledge inherent in the teacher's representation. To address this limitation, the authors propose using contrastive learning to distill more comprehensive information from the teacher network.

Key Contributions

Contrastive-Based Distillation Objective

The authors present a new distillation objective leveraging contrastive learning to capture correlations and higher-order dependencies within the representation space. This objective is inspired by recent advances in self-supervised learning. Specifically, the objective maximizes a lower bound to the mutual information between the teacher and student representations, ensuring that the student captures more nuanced structural information.

Superior Performance Across Tasks

The proposed Contrastive Representation Distillation (CRD) method is evaluated across various knowledge transfer tasks, demonstrating its superiority over existing methods. The tasks include:

  1. Model Compression: CRD effectively compresses large networks into smaller ones without significant loss in performance.
  2. Cross-Modal Transfer: CRD transfers knowledge between different sensory modalities, improving performance in tasks like transferring knowledge from image to sound or depth.
  3. Ensemble Distillation: CRD distills ensembles of models into single estimators more effectively than other methods.

Experimental Results

The authors benchmark CRD against 12 recent distillation methods, showing substantial improvements:

  1. CIFAR-100 Dataset: CRD outperforms the original KD by an average relative improvement of 57%. For instance, CRD achieves a test accuracy of 75.48% when distilling a WRN-40-2 teacher into a WRN-16-2 student, compared to 74.92% for KD.
  2. ImageNet: On ImageNet, using ResNet-34 as the teacher and ResNet-18 as the student, CRD improves top-1 accuracy by 1.42% over the AT method.
  3. Cross-Modal Transfer: In tasks transferring from luminance to chrominance or from RGB to depth, CRD consistently yields higher accuracy. For example, CRD achieves a pixel accuracy of 61.6% in the RGB to depth transfer task on the NYU-Depth dataset.
  4. Ensemble Distillation: CRD reduces the error rate of a WRN-16-2 student to 23.7% from an ensemble of 8 teachers on CIFAR-100, significantly outperforming KD.

Implications and Future Directions

Practical Implications

CRD's ability to distill rich, structured knowledge makes it particularly suitable for applications where model simplicity is crucial, such as deployment on edge devices. The improved performance in cross-modal transfer tasks can facilitate better multi-modal learning systems, enhancing applications in autonomous systems and multi-sensor fusion.

Theoretical Implications

By connecting the fields of knowledge distillation and representation learning, this paper paves the way for new research directions. The mutual information maximization perspective offers a robust theoretical foundation for understanding and improving distillation methods.

Speculative Future Developments

Looking ahead, CRD could be combined with other advanced representation learning techniques to explore more efficient network architectures. Moreover, adapting CRD to unsupervised settings or combining it with reinforcement learning could further extend its versatility.

Conclusion

Contrastive Representation Distillation presents a compelling advancement in the field of neural network distillation. Its ability to capture and transfer structured representational knowledge more effectively than existing methods marks a significant step forward. The robust experimental results across diverse tasks highlight CRD's practical relevance and underscore its potential to influence future research in both knowledge distillation and representation learning.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com