Contrastive Learning-Based Spectral Knowledge Distillation for Multi-Modality and Missing Modality Scenarios in Semantic Segmentation (2312.02240v1)

Published 4 Dec 2023 in cs.CV and cs.AI

Abstract: Improving the performance of semantic segmentation models using multispectral information is crucial, especially for environments with low-light and adverse conditions. Multi-modal fusion techniques pursue either the learning of cross-modality features to generate a fused image or engage in knowledge distillation but address multimodal and missing modality scenarios as distinct issues, which is not an optimal approach for multi-sensor models. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A Novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.

Authors (3)

Aniruddh Sikdar (11 papers)
Jayant Teotia (1 paper)
Suresh Sundaram (68 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces CSK-Net, which uses contrastive learning for effective spectral knowledge distillation in multi-modal semantic segmentation.
It integrates optical and IR data with a gated spectral unit and mixed feature exchange strategy to capture distinctive modality-specific features.
CSK-Net outperforms state-of-the-art models on three benchmarks, significantly improving segmentation in missing modality scenarios without extra cost.

Semantic segmentation, the process of classifying every pixel in an image into a category, is an essential task in computer vision with real-world applications such as autonomous driving and robotics. One hurdle in semantic segmentation is the reliance on optical images, which often perform poorly under adverse conditions like rain or low light. To address this, the use of infrared (IR) cameras, which can penetrate obstructions like dust and smoke, has become increasingly popular. The combination of optical (EO) and IR images forms the basis of multi-spectral semantic segmentation, which aims to improve segmentation performance by fusing data from both modalities.

To advance this domain, a Contrastive Learning-Based Spectral Knowledge Distillation Network (CSK-Net) has been proposed, which employs a unique approach for spectral knowledge distillation. The model uses contrastive learning as a foundation, alongside an automatic mixed feature exchange mechanism. This allows the network to perform semantic segmentation using EO and IR images by distilling detailed textures and spectral knowledge from the optical images into an optical branch, maintained by shared encoders with separate batch normalization (BN) layers for different modalities, ensuring the capture of multi-spectral information effectively.

Key to CSK-Net are two novel components: a Gated Spectral Unit (GSU) and a mixed feature exchange strategy. The GSU is designed to merge spectral information from different imaging modalities efficiently, crucial for enhancing the knowledge distillation process. The mixed feature exchange approach aims to increase the correlation of modality-shared information during distillation without imposing unnecessary constraints on the process, which could hinder the learning of high-frequency, modality-specific information. Significantly, such features are vital for the distinctiveness of each modality—textures for visible images and thermal radiation for IR images.

The model has been tested extensively across three public benchmark datasets and has shown remarkable performance improvements compared to existing state-of-the-art models in tasks involving both multi-modality and scenarios where one modality is missing, specifically when exclusively using IR data for inference. Notably, the improvement achieved in missing modality scenarios comes without incurring additional computational costs compared to baseline segmentation models.

In summary, CSK-Net deals with multi-modal and missing modality challenges in a cohesive manner, rather than addressing them as separate issues. This conceptual sophistication leads to a system that not only performs better than its counterparts but does so by considering the full spectrum of available data, thereby setting a new standard for multi-spectral semantic segmentation tasks.

PDF Markdown

Contrastive Learning-Based Spectral Knowledge Distillation for Multi-Modality and Missing Modality Scenarios in Semantic Segmentation (2312.02240v1)

Summary

Related Papers

Tweets