Simple Unsupervised Knowledge Distillation With Space Similarity (2409.13939v1)

Published 20 Sep 2024 in cs.AI and cs.CV

Abstract: As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher's embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher's latent manifold due to their sole reliance on $L_2$ normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbf{space similarity}, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.

Summary

The paper introduces a novel distillation method that directly aligns the student’s embedding manifold with the teacher’s using space similarity.
It combines cosine and space similarity losses to preserve inter/intra-sample relationships, achieving up to 15% improved top-1 accuracy on ImageNet benchmarks.
Empirical results show enhanced transfer learning, dense predictions, and image retrieval performance, demonstrating the method’s robustness across various tasks.

Simple Unsupervised Knowledge Distillation With Space Similarity

Introduction

In recent literature, the challenge of distilling high-capacity networks trained via self-supervised learning (SSL) into smaller architectures has garnered substantial interest. Self-supervised learning has proven to be effective for training models on vast datasets without labels, leading to the development of robust, task-agnostic representations. However, smaller sized networks, often used in real-time applications such as autonomous driving and industrial automation, struggle to harness the benefits of SSL due to their limited parameter capacity.

Related Work

Prior approaches like SEED initiated the direction of unsupervised knowledge distillation (UKD), enabling smaller networks to benefit from SSL pre-trained large models. Extending SEED, several methods such as BINGO, DisCo, and SMD manually construct and preserve similarity relationships among samples, often using embedding queues. Nonetheless, these methods either fail to perfectly model the underlying manifold of the teacher’s embeddings or rely excessively on $L_2$ normalized features, which, due to their non-invertible nature, lose significant information and cannot preserve the original manifold structure.

Methodology

The paper introduces a novel approach that bypasses the manual construction of relationships and directly aligns the student’s embedding manifold with that of the teacher. This alignment ensures that all inter/intra sample relationships are conserved indirectly if the manifolds are similar. Specifically, the authors propose a space similarity objective that targets each dimension of the student's feature space to match the corresponding dimension of the teacher’s.

The proposed approach includes two main components:

Cosine Similarity Loss (L_co): This component focuses on the pair-wise similarity of the normalized embeddings between the teacher and the student.
Space Similarity Loss (L_ss): This new objective targets preserving the spatial dimension similarity by normalizing and aligning corresponding dimensions of the transposed feature matrices of the teacher and student.

By combining these components ( $L_{CoSS}$ ), the method ensures that the student model learns to mimic the overall structure and mapping of the teacher’s manifold, rather than isolated similarities.

Results

The experimental validation spans various benchmarks, providing strong empirical evidence of the effectiveness and robustness of the proposed CoSS method.

ImageNet Distillation:
- On distilling a Moco-v2 ResNet-50 teacher to smaller architectures like ResNet-18, ResNet-34, and EfficientNet-B0, the proposed method consistently outperforms state-of-the-art techniques like SEED and BINGO. Notably, CoSS achieves top-1 accuracy improvements of approximately 10%, 7%, and 15% respectively for these student models over their baseline versions trained directly with Moco-v2.
Transfer Learning:
- CoSS-trained students exhibit superior transfer learning accuracies across a wide array of datasets such as CIFAR-10, CIFAR-100, STL-10, and others. The ResNet-18 student distilled using CoSS consistently outperforms other methods on multiple datasets, demonstrating the generalizability and robustness of the learned representations.
Dense Predictions:
- For dense prediction tasks like object detection and segmentation on VOC and COCO datasets, the ResNet-18 CoSS student competes closely with PCD, a specialized method for dense predictions, and even surpasses it on COCO.
Image Retrieval:
- CoSS demonstrates improved performance in image retrieval benchmarks such as Oxford-5k and FORB, indicating the robustness of the learned embeddings in matching and retrieval tasks.
Out-of-Distribution Robustness:
- The distilled models show competitive performance on datasets like ImageNet-v2, ImageNet-Sketch, and ImageNet-C, highlighting their robustness against distribution shifts and various corruptions.

Implications and Future Work

The approach proposed provides a significant leap in unsupervised knowledge distillation by directly modeling the teacher’s embedding manifold. This method could potentially be integrated with existing frameworks, enhancing their applicability and performance in various scenarios. While demonstrated primarily in the domain of computer vision, these principles may extend to natural language processing, particularly with the rise of unsupervised large models.

Future directions involve exploring stronger constraints beyond homeomorphism to better capture the topological fidelity of the teacher’s embeddings and extending the current methods to other modalities and multi-task settings.

Conclusion

This paper presents a strong case for direct manifold modeling in unsupervised knowledge distillation. By leveraging space similarity in conjunction with traditional cosine similarity, the approach yields significant improvements across various benchmarks, offering an efficient, robust technique for distilling the essence of large, self-supervised models into smaller, more deployable architectures.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HaohanWang/status/1840784743551389815