Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion (1911.07848v4)

Published 18 Nov 2019 in cs.CV, cs.LG, and cs.MM

Abstract: Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative. code is available at: \url{https://github.com/TmacMai/ARGF_multimodal_fusion}

PDF Abstract

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

The paper "Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion" introduces an innovative framework for addressing challenges in multimodal fusion. This framework, termed Adversarial Representation Graph Fusion (ARGF), aims at bridging the modality gap through adversarial training and employs a hierarchical graph fusion network for effective integration of cross-modal data.

Core Contributions

This paper introduces a sophisticated adversarial encoder-decoder-classifier architecture to learn a modality-invariant embedding space, offering a robust approach to transform diverse modality distributions into a common representation. Encoders act as generators to map representations of source modalities into the distribution of the target modality, which a discriminator aims to differentiate. The framework supports unimodal information retention through decoder networks, while a classifier ensures the embedding space remains discriminative concerning the learning task.

The hierarchical graph fusion network (GFN) achieves multimodal integration through layers dedicated to unimodal, bimodal, and trimodal interaction modeling. The GFN explicitly estimates and assigns importance weights to individual interactions, enhancing the interpretability and flexibility of fusion processes. This allows the network to address varying contribution levels from different modalities effectively.

Experimental Analysis

ARGF was evaluated on three challenging multimodal datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP. It achieved state-of-the-art performance and demonstrated superior accuracy compared to existing methods such as LMFN and tensor fusion techniques. These results affirm the effectiveness of distribution matching prior to fusion and the robustness of hierarchical graph-based integration.

Implications and Future Directions

The proposed framework not only advances current methods in multimodal fusion by resolving statistical heterogeneity but also opens pathways for more effective cross-modal interaction exploration through dynamic graph-based processing. This elaboration on modality fusion impacts many applications, including sentiment analysis, emotion recognition, and language interpretation in multimodal contexts. Future research may investigate the scalability of ARGF in broader datasets and explore its application to other domains in AI.

The framework's strong theoretical foundation suggests promising extensions into advanced AI applications where multimodal data are ubiquitous, thereby further improving the comprehensiveness and accuracy of affective computing systems. The paper paves the way for integrating multimodal information more efficiently, reinforcing the importance of learning joint embedding spaces to fundamentally advance the understanding and utilization of multimodal data in AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Sijie Mai (14 papers)
Haifeng Hu (27 papers)
Songlong Xing (4 papers)

Citations (165)

View on Semantic Scholar

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion (1911.07848v4)