Channel-wise Knowledge Distillation for Dense Prediction (2011.13256v4)

Published 26 Nov 2020 in cs.CV

Abstract: Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for dense prediction tasks align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a probabilty map using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures. Code is available at: https://git.io/Distiller

Citations (226)

View on Semantic Scholar

Summary

The paper introduces channel-wise knowledge distillation by converting channel activations into soft probability distributions using KL divergence.
It refines the distillation process with channel normalization, emphasizing key features for accurate dense prediction tasks.
Experimental results demonstrate significant improvements in mAP and mIoU, making the method efficient for resource-constrained applications.

Overview of Channel-wise Knowledge Distillation for Dense Prediction

This paper addresses knowledge distillation (KD) for dense prediction tasks, including semantic segmentation and object detection. The authors propose a novel method termed Channel-wise Knowledge Distillation, specifically focusing on optimizing the distillation process by leveraging channel-level activations. Unlike prior approaches, which primarily emphasize spatial alignment of activation maps, this research proposes normalizing activation maps of individual channels into soft probability distributions. This method subsequently minimizes the Kullback-Leibler (KL) divergence between the teacher and student network for each channel, emphasizing critical regions conducive to dense prediction tasks.

Methodological Contributions

The proposed channel-wise KD method introduces a refined focus on channel normalization and KL divergence, presenting several methodological advancements over existing KD approaches:

Channel Normalization: By transforming each channel’s activations into a probability distribution, the method ensures attention is placed on salient features specific to dense prediction tasks. This strategy contrasts with the spatial methods that often treat each feature location with equal importance.
KL Divergence: The use of KL divergence as a metric emphasizes salient regions by assigning greater importance to pixel locations with higher activation in the teacher’s network. This selective focus is integral to enhancing the learning of precise localization required in dense prediction.
Empirical Performance: The authors conduct extensive experiments on benchmark datasets such as COCO, Cityscapes, ADE20K, and Pascal VOC. Their results demonstrate significant gains in mAP and mIoU metrics, with notable improvements of up to 3.4% in mAP for object detection on the COCO dataset and 5.81% in mIoU for semantic segmentation on the Cityscapes dataset using the PSPNet with a ResNet18 backbone.

Comparative Analysis

The paper juxtaposes the proposed channel-wise method against state-of-the-art distillation techniques, including spatial knowledge distillation variants. The empirical analysis reveals that the method not only enhances prediction accuracy but also reduces computational overhead, therefore providing a more efficient learning paradigm especially suited for mobile and less computationally endowed applications.

Implications and Future Directions

The proposed method strikes a crucial balance between accuracy and efficiency in dense prediction models. It implies practical applications in deploying high-performance models on resource-constrained devices, an increasingly pertinent requirement given the proliferation of AI applications across various technological platforms.

Theoretically, the concepts introduced provoke further exploration into other channel-focused methodologies in deep learning. The potential to extend such channel-wise approaches to areas like depth estimation or even beyond vision tasks highlights a compelling research trajectory. Additionally, further investigation into optimizing and tuning the KL divergence and channel normalization parameters could yield even richer performance gains.

Conclusion

In conclusion, this paper presents a sophisticated yet straightforward approach to enhancing KD for dense prediction tasks. Through meticulous channel-wise probability mapping and KL divergence minimization, the research significantly contributes to the field of computer vision, promising substantial impacts on both theoretical understanding and practical implementations of compact models. As deep learning continues to evolve, such targeted, granular methodologies will likely form the backbone of future innovations.

PDF Markdown