- The paper presents a novel channel-wise KD method that normalizes activation maps into probability distributions to highlight salient regions.
- It minimizes KL divergence between teacher and student channel distributions, yielding superior results compared to spatial techniques.
- Experimental results show improvements of 3.4% mAP on COCO and 5.81% mIoU on Cityscapes, demonstrating its efficacy in dense prediction tasks.
Channel-wise Knowledge Distillation for Dense Prediction
This paper introduces a novel channel-wise knowledge distillation (KD) approach tailored for dense prediction tasks. The core idea diverges from prevalent spatial KD methods by focusing on the information encoded within individual channels of activation maps. The method normalizes the activation map of each channel into a probability distribution and minimizes the Kullback-Leibler (KL) divergence between the channel-wise probability maps of the teacher and student networks. This strategy directs the student network's attention towards salient regions within each channel, which is beneficial for dense prediction tasks like semantic segmentation and object detection.
Methodology
The method contrasts with existing spatial distillation techniques, which typically align activation maps in the spatial domain by normalizing activation values at each spatial location. Instead, this work proposes normalizing activation maps channel-wise, converting them into probability distributions. The KL divergence is then minimized between the teacher's and student's channel distributions. The channel-wise distillation loss is defined as:
$\varphi(\phi({\it{y}^{T}), \phi({\it{y}^{S})) = \varphi(\phi({\it{y}^{T}_{c}), \phi({\it{y}^{S}_{c})),$
where ϕ(⋅) converts activation values into a probability distribution using a softmax function:
$\phi%
{\left(y_{c}\right) } = \frac{\textup{exp}{(\frac{y_{c,i}}{\mathcal{T} )} {\sum_{i = 1}^{W\cdot H} \textup{exp}{(\frac{y_{c,i}}{\mathcal{T} )} } .$
Here, yc represents the activation values of channel c, i indexes the spatial location, and T is a temperature hyperparameter. The discrepancy between teacher and student channel distributions is evaluated using KL divergence:
φ(yT,yS)=CT2c=1∑Ci=1∑W⋅Hϕ(yc,iT)⋅log[ϕ(yc,iS)ϕ(yc,iT)].
The asymmetry of KL divergence ensures that the student network focuses on mimicking the teacher's foreground saliency, while background activations have less impact on learning.
Experimental Results
The efficacy of the proposed method was evaluated on semantic segmentation (Cityscapes, ADE20K, Pascal VOC) and object detection (MS-COCO 2017). The results demonstrate that the channel-wise KD outperforms existing spatial KD methods. For instance, RetinaNet (ResNet50 backbone) showed a 3.4% improvement in mAP on the COCO dataset, and PSPNet (ResNet18 backbone) achieved a 5.81% increase in mIoU on the Cityscapes dataset. Ablation studies validated the importance of channel-wise normalization and asymmetric KL divergence. The method also showed consistent improvements across various network architectures and benchmarks.
Implications and Future Directions
The channel-wise KD paradigm offers a simple yet effective approach for training compact networks for dense prediction tasks. The results indicate that focusing on channel-wise information, rather than strict spatial alignment, can lead to better knowledge transfer. The consistent improvements across different tasks and network structures suggest the generalizability of the proposed method. Future research could explore adaptive temperature scaling for different channels or layers, as well as extending the method to other dense prediction tasks such as instance segmentation, depth estimation, and panoptic segmentation.
Conclusion
This paper presents a channel-wise KD method that significantly improves the performance of student networks in dense prediction tasks. By converting channel activations into probability distributions and minimizing the KL divergence, the method effectively transfers knowledge from teacher to student networks. The experimental results and ablation studies confirm the effectiveness and efficiency of the proposed approach, establishing it as a strong baseline for KD in dense prediction.