Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction (2112.02252v2)

Published 4 Dec 2021 in cs.CV

Abstract: Multimodal fusion and multitask learning are two vital topics in machine learning. Despite the fruitful progress, existing methods for both problems are still brittle to the same challenge -- it remains dilemmatic to integrate the common information across modalities (resp. tasks) meanwhile preserving the specific patterns of each modality (resp. task). Besides, while they are actually closely related to each other, multimodal fusion and multitask learning are rarely explored within the same methodological framework before. In this paper, we propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for multimodal and multitask dense image prediction. At its core, CEN adaptively exchanges channels between subnetworks of different modalities. Specifically, the channel exchanging process is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training. For the application of dense image prediction, the validity of CEN is tested by four different scenarios: multimodal fusion, cycle multimodal fusion, multitask learning, and multimodal multitask learning. Extensive experiments on semantic segmentation via RGB-D data and image translation through multi-domain input verify the effectiveness of CEN compared to state-of-the-art methods. Detailed ablation studies have also been carried out, which demonstrate the advantage of each component we propose. Our code is available at https://github.com/yikaiw/CEN.

PDF Abstract

Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction

The paper "Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction" by Wang et al. introduces a novel approach to enhancing multimodal fusion and multitask learning in the field of dense image prediction tasks. These tasks encompass pixel-wise classification and regression challenges, such as semantic segmentation and image-to-image translation, where robust spatial detail and semantic extraction are imperative. This paper tackles the inherent challenge of achieving a balance between integrating shared information across modalities while preserving specific patterns for each modality or task.

The core proposition of this research is the Channel-Exchanging-Network (CEN), a framework designed to exchange information between subnetworks of different modalities or tasks adaptively. This mechanism is regulated through the use of Batch-Normalization (BN) scaling factors, where channels associated with low scaling factors—indicating lesser importance—are replaced by the mean channels from other subnetworks. This exchange does not entail additional parameters, rendering the system parameter-free and adaptive.

One of the significant claims of the paper is the broader applicability of CEN, extending beyond a single task to multimodal fusion, cycle multimodal fusion, multitask learning, and multimodal multitask learning. For example, in multimodal fusion, CEN performs channel exchanging across the encoders of different modalities to enhance the fusion of input from varying data types, such as RGB and depth in semantic segmentation tasks. This approach effectively reduces model complexity while maintaining robust predictive performance.

The authors empirically demonstrate the effectiveness of their method using extensive experiments across different settings. On tasks like semantic segmentation and image-to-image translation, CEN consistently outperforms traditional fusion methods like concatenation, alignment, and attention-based fusion. Notably, CEN achieves superior performance in scenarios with multimodal inputs, such as RGB-D data, indicating its efficacy in leveraging complementary information across modalities.

Numerical results highlight the improvement CEN brings to existing models. For instance, CEN led to improvements over the state-of-the-art in semantic segmentation on datasets such as NYUDv2 and SUN RGB-D. Additionally, the channel exchanging process proposed by CEN facilitates the model to uniformly and efficiently handle various configurations of multimodal and multitask learning, often reducing model complexity significantly without sacrificing performance.

The implications of this paper are multifaceted. Theoretically, it provides a new prism through which shared and specific information can be dynamically assessed and optimized within neural networks. Practically, the findings encourage the deployment of more cohesive and computationally efficient models for applications requiring multimodal sensory data integration—common in autonomous systems and computer vision applications.

Looking forward, the CEN can inspire future research in several directions. Its extension to more heterogeneous datasets, where modalities are significantly different (such as radar and visual data), could be explored. Additionally, integrating this method into unsupervised learning frameworks presents a fertile ground for further advancement, offering adaptive techniques for domains with scarce labeled data. As the AI field progresses, the principled exchange of information as outlined in this paper may become a cornerstone capability for multifaceted neural network architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yikai Wang (78 papers)
Fuchun Sun (127 papers)
Wenbing Huang (95 papers)
Fengxiang He (46 papers)
Dacheng Tao (826 papers)

Citations (20)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - yikaiw/CEN: [TPAMI 2022, NeurIPS 2020] Code release for "Deep Multimodal Fusion by Channel Exchanging" (278 stars)