- The paper presents a framework that transfers supervisory signals from labeled to unlabeled image modalities, enhancing mid-level feature learning.
- It utilizes CNNs to distill information from RGB to depth and optical flow images, significantly improving object and action detection metrics.
- Experimental results on datasets like NYU Depth and JHMDB show notable gains in mean Average Precision and detection performance.
Cross Modal Distillation for Supervision Transfer
This paper presents a methodological framework for transferring learned representations between different image modalities, specifically leveraging supervision from well-labeled modalities to improve the feature learning of unlabeled ones. The proposed technique, termed "Cross Modal Distillation," is positioned as a pre-training process for initially unlabeled modalities, demonstrating marked performance enhancements across tasks such as object detection and segmentation when applied to depth and optical flow images.
Methodology
The core contribution involves using learned mid-level representations from a large, labeled dataset to provide supervisory signals for training representations on a pairable but unlabeled modality. The authors employ convolutional neural networks (CNNs) for this transfer process, selecting mid-level layers—such as the last convolutional layer—as the supervisory signal, resulting in improved feature hierarchies in the target modality.
Experimental Results
The paper provides comprehensive experimental validation on two datasets: the NYU Depth Dataset for depth images and the JHMDB dataset for optical flow images. For depth images, supervision transfer from RGB images resulted in significant improvements, raising object detection performance from state-of-the-art baselines by several percentage points in mean Average Precision (mAP). For example, depth image detection improved from 34.2% to 41.7%, and combining RGB with depth detection performance rose to 49.1%.
In the context of optical flow images—where the supervision originates from RGB video datasets—the model saw action detection performance increase from a baseline of 31.7% to 35.7%.
Implications and Future Work
The implications of this research are far-reaching in the sphere of computer vision, particularly for modalities lacking large annotated datasets. The ability to effectively transfer learned representations to new modalities enables the use of advanced vision systems in domains where manual annotation is impractical or data scarcity is a barrier. The method's potential for "zero-shot" detection, where object detectors can be adapted to new modalities without direct annotations, highlights its robust versatility.
Future research may explore varying architectures and more complex scenarios of modality pairing. Additionally, expanding the diversity and number of paired images used in supervision transfer could refine the effectiveness further, especially at higher network layers.
Conclusion
Cross Modal Distillation demonstrates a compelling solution for cross-modal supervision transfer, resulting in significant performance enhancements in otherwise unlabeled image modalities. By circumventing the need for vast annotation efforts, this work paves the way for the efficient application of sophisticated image representations across diverse and emerging imaging modalities.