Deep Multimodal Fusion by Channel Exchanging (2011.05005v2)

Published 10 Nov 2020 in cs.CV and cs.LG

Abstract: Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based fusion are still inadequate in balancing the trade-off between inter-modal fusion and intra-modal processing, incurring a bottleneck of performance improvement. To this end, this paper proposes Channel-Exchanging-Network (CEN), a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities. Specifically, the channel exchanging process is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training. The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network. Extensive experiments on semantic segmentation via RGB-D data and image translation through multi-domain input verify the effectiveness of our CEN compared to current state-of-the-art methods. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose. Our code is available at https://github.com/yikaiw/CEN.

PDF Abstract

Deep Multimodal Fusion by Channel Exchanging: A Comprehensive Overview

The paper "Deep Multimodal Fusion by Channel Exchanging" presents a novel framework, the Channel-Exchanging-Network (CEN), designed to improve multimodal fusion methodologies in machine learning. Multimodal fusion techniques have traditionally been categorized into aggregation-based and alignment-based methods, but they face challenges in effectively balancing inter-modal and intra-modal information processing. CEN addresses these challenges by introducing a self-regulating, parameter-free method for dynamically exchanging channels between the sub-networks of different modalities.

Framework and Methodology

CEN differentiates itself from traditional methods by foregoing the conventional aggregation and alignment strategies. Instead, it adopts a novel channel exchanging scheme that leverages Batch Normalization (BN) scaling factors as indicators of channel importance. Specifically, channels with close-to-zero scaling factors, deemed less informative, are replaced with the averaged channels from other modalities. This exchange is parameter-free and guided by the dynamics of the training itself, making it adaptive and efficient.

A distinctive feature of CEN is the sharing of convolutional filters across modalities while maintaining separate BN layers. This architectural design helps retain the compactness typical of single-modal networks, enhancing computational efficiency without sacrificing performance. This separation in BN layers also allows for better parameter optimization and fusion of modality-specific and modality-common features.

Empirical Evaluation

The experimental results are notable, showcasing the CEN framework's superiority over existing methods across two tasks: semantic segmentation and image-to-image translation.

Semantic Segmentation: Conducted on NYUDv2 and SUN RGB-D datasets, CEN outperformed state-of-the-art methods such as RefineNet and RDFNet by significant margins in mean IoU, pixel accuracy, and mean accuracy. Notably, CEN demonstrated its effectiveness in preserving detailed segmentations and improving robustness across varying lighting conditions.
Image Translation: Using the Taskonomy dataset, CEN consistently outperformed traditional fusion methods (concatenation, alignment, self-attention), achieving lower FID and KID scores across several translation tasks. The framework demonstrated its flexibility in incorporating multiple modalities beyond just pairs, showing promising scalability.

Implications and Future Directions

The development of CEN suggests substantial implications for both theoretical exploration and practical deployment in AI applications. The parameter-free nature of the framework reduces the complexity in model design and engineering overhead, a significant advantage for deploying machine learning models at scale. The operations within CEN, grounded in channel exchanging, encourage further inquiry into adaptive, sparsity-driven network design principles.

Future research could explore extending CEN to heterogeneous modalities, such as the fusion of text and image data, where modality characteristics are inherently diverse. Additionally, investigating the integration of more sophisticated attention mechanisms or exploring other forms of sparsity regularization could refine the channel exchanging process further.

In conclusion, CEN represents a meaningful advancement in the multimodal fusion domain. It effectively addresses the dual challenges of inter-modal information exchange and intra-modal propagation, offering a promising avenue for future research and application in multimodal learning environments.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yikai Wang (78 papers)
Wenbing Huang (95 papers)
Fuchun Sun (127 papers)
Tingyang Xu (55 papers)
Yu Rong (146 papers)
Junzhou Huang (137 papers)

Citations (213)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - yikaiw/CEN: [TPAMI 2022, NeurIPS 2020] Code release for "Deep Multimodal Fusion by Channel Exchanging" (278 stars)

Tweets

https://twitter.com/yikaiw2/status/1327099917748191234