Papers
Topics
Authors
Recent
2000 character limit reached

Channel-wise Knowledge Distillation for Dense Prediction

Published 26 Nov 2020 in cs.CV | (2011.13256v4)

Abstract: Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for dense prediction tasks align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a probabilty map using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures. Code is available at: https://git.io/Distiller

Citations (226)

Summary

  • The paper presents a novel channel-wise KD method that normalizes activation maps into probability distributions to highlight salient regions.
  • It minimizes KL divergence between teacher and student channel distributions, yielding superior results compared to spatial techniques.
  • Experimental results show improvements of 3.4% mAP on COCO and 5.81% mIoU on Cityscapes, demonstrating its efficacy in dense prediction tasks.

Channel-wise Knowledge Distillation for Dense Prediction

This paper introduces a novel channel-wise knowledge distillation (KD) approach tailored for dense prediction tasks. The core idea diverges from prevalent spatial KD methods by focusing on the information encoded within individual channels of activation maps. The method normalizes the activation map of each channel into a probability distribution and minimizes the Kullback-Leibler (KL) divergence between the channel-wise probability maps of the teacher and student networks. This strategy directs the student network's attention towards salient regions within each channel, which is beneficial for dense prediction tasks like semantic segmentation and object detection.

Methodology

The method contrasts with existing spatial distillation techniques, which typically align activation maps in the spatial domain by normalizing activation values at each spatial location. Instead, this work proposes normalizing activation maps channel-wise, converting them into probability distributions. The KL divergence is then minimized between the teacher's and student's channel distributions. The channel-wise distillation loss is defined as:

$\varphi(\phi({\it{y}^{T}), \phi({\it{y}^{S})) = \varphi(\phi({\it{y}^{T}_{c}), \phi({\it{y}^{S}_{c})),$

where ϕ()\phi(\cdot) converts activation values into a probability distribution using a softmax function:

$\phi% {\left(y_{c}\right) } = \frac{\textup{exp}{(\frac{y_{c,i}}{\mathcal{T} )} {\sum_{i = 1}^{W\cdot H} \textup{exp}{(\frac{y_{c,i}}{\mathcal{T} )} } .$

Here, ycy_c represents the activation values of channel cc, ii indexes the spatial location, and T\mathcal{T} is a temperature hyperparameter. The discrepancy between teacher and student channel distributions is evaluated using KL divergence:

φ(yT,yS)=T2Cc=1Ci=1WHϕ(yc,iT)log[ϕ(yc,iT)ϕ(yc,iS)].\varphi% {\left(y^{T}, y^{S}\right) } = \frac{\mathcal{T}^2}{C}\sum_{c = 1}^{C}\sum_{i=1}^{W\cdot H} \phi (y^{T}_{c,i}) \cdot \log \Bigl[ \frac{\phi(y^{T}_{c,i})}{\phi(y^{S}_{c,i})} \Bigr].

The asymmetry of KL divergence ensures that the student network focuses on mimicking the teacher's foreground saliency, while background activations have less impact on learning.

Experimental Results

The efficacy of the proposed method was evaluated on semantic segmentation (Cityscapes, ADE20K, Pascal VOC) and object detection (MS-COCO 2017). The results demonstrate that the channel-wise KD outperforms existing spatial KD methods. For instance, RetinaNet (ResNet50 backbone) showed a 3.4% improvement in mAP on the COCO dataset, and PSPNet (ResNet18 backbone) achieved a 5.81% increase in mIoU on the Cityscapes dataset. Ablation studies validated the importance of channel-wise normalization and asymmetric KL divergence. The method also showed consistent improvements across various network architectures and benchmarks.

Implications and Future Directions

The channel-wise KD paradigm offers a simple yet effective approach for training compact networks for dense prediction tasks. The results indicate that focusing on channel-wise information, rather than strict spatial alignment, can lead to better knowledge transfer. The consistent improvements across different tasks and network structures suggest the generalizability of the proposed method. Future research could explore adaptive temperature scaling for different channels or layers, as well as extending the method to other dense prediction tasks such as instance segmentation, depth estimation, and panoptic segmentation.

Conclusion

This paper presents a channel-wise KD method that significantly improves the performance of student networks in dense prediction tasks. By converting channel activations into probability distributions and minimizing the KL divergence, the method effectively transfers knowledge from teacher to student networks. The experimental results and ablation studies confirm the effectiveness and efficiency of the proposed approach, establishing it as a strong baseline for KD in dense prediction.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.