Structured Knowledge Distillation for Dense Prediction

Published 11 Mar 2019 in cs.CV | (1903.04197v7)

Abstract: In this work, we consider transferring the structure information from large networks to compact ones for dense prediction tasks in computer vision. Previous knowledge distillation strategies used for dense prediction tasks often directly borrow the distillation scheme for image classification and perform knowledge distillation for each pixel separately, leading to sub-optimal performance. Here we propose to distill structured knowledge from large networks to compact networks, taking into account the fact that dense prediction is a structured prediction problem. Specifically, we study two structured distillation schemes: i) pair-wise distillation that distills the pair-wise similarities by building a static graph; and ii) holistic distillation that uses adversarial training to distill holistic knowledge. The effectiveness of our knowledge distillation approaches is demonstrated by experiments on three dense prediction tasks: semantic segmentation, depth estimation and object detection. Code is available at: https://git.io/StructKD

Abstract PDF Upgrade to Chat

Citations (548)

View on Semantic Scholar

Summary

The paper introduces structured distillation methods that leverage pair-wise and holistic approaches to capture spatial dependencies in dense prediction tasks.
It adapts a Markov random field framework and adversarial training to align teacher and student models for improved semantic segmentation, depth estimation, and object detection.
Experimental results demonstrate significant gains in mIoU and AP scores, enabling efficient deployment of compact models in resource-constrained environments.

Structured Knowledge Distillation for Dense Prediction

The paper "Structured Knowledge Distillation for Dense Prediction" by Yifan Liu et al. addresses the critical challenge of transferring structured information from large networks (teachers) to smaller, compact models (students) specifically for dense prediction tasks in computer vision. These tasks, which include semantic segmentation, depth estimation, and object detection, require detailed pixel-level predictions and are computationally intensive.

The authors critique traditional knowledge distillation approaches that focus on image classification by performing pixel-wise distillation. Such methods, they argue, are sub-optimal for dense prediction tasks as they ignore the structured nature of these problems. To address this, the paper introduces two key structured distillation mechanisms: pair-wise distillation and holistic distillation.

Pair-Wise Distillation

In pair-wise distillation, the authors derive inspiration from the Markov random field framework to capture spatial labeling consistency. This approach involves building a static affinity graph to represent spatial pair-wise similarities across the network's feature space. The teacher and student networks are aligned using these pair-wise similarities, aiming to maintain both short and long-range structural information. The effectiveness of a fully connected graph and various granularities are extensively evaluated, showing substantial improvements over pixel-wise methods.

Holistic Distillation

Holistic distillation leverages adversarial training to align higher-order consistencies between outputs of the teacher and student networks. The authors employ a conditional generative adversarial network (GAN) to encode the knowledge about the quality of network outputs, enabling the student network to replicate the teacher's performance more effectively. The paper outlines how this method overcomes limitations of pixel-wise approaches by capturing holistic structural properties absent in lower-order labels.

Experimental Results

The paper substantiates its claims through comprehensive experiments across several state-of-the-art architectures and datasets like Cityscapes, CamVid, and ADE20K. Numerical results showcase considerable performance gains in mIoU (mean Intersection over Union) when employing the proposed distillation methods. For instance, using holistic distillation, MobileNetV2Plus achieved notable IoU improvements, particularly with complex, structured classes such as buses and trucks.

In depth estimation evaluated on NYUD-V2, the structured distillation improves the accuracy while decreasing the relative error significantly. On the detection side, the structured methods enhance the AP scores across students with varying architectural complexities.

Implications and Future Work

This work has profound implications for deploying neural networks in resource-constrained environments, such as mobile devices, by enabling smaller models to retain the performance of larger ones without increasing inference costs. The methods also promise enhancements for any task defined by dense prediction, as they exploit structural dependencies in data more effectively than traditional pixel-wise strategies.

Future exploration could explore applying these concepts to an even broader range of tasks or refining the granularity and scope of the affinity graphs used in pair-wise distillation. Additionally, exploring unsupervised or semi-supervised domains could further extend the applicability of structured distillation methods.

In conclusion, this paper provides significant advancements in knowledge distillation for dense prediction through innovative structured approaches. The results are not only empirically validated but also practically consequential, paving the way for more efficient and effective deployment of compact models in various applications.