Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

Published 1 Sep 2021 in cs.CL and cs.AI | (2109.00412v2)

Abstract: In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of synthesized embeddings. These embeddings are generated from the upstream process called multimodal fusion, which aims to extract and combine the input unimodal raw data to produce a richer multimodal representation. Previous work either back-propagates the task loss or manipulates the geometric property of feature spaces to produce favorable fusion results, which neglects the preservation of critical task-related information that flows from input to the fusion results. In this work, we propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and unimodal input in order to maintain task-related information through multimodal fusion. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task. To address the intractable issue of MI bounds, we further formulate a set of computationally simple parametric and non-parametric methods to approximate their truth value. Experimental results on the two widely used datasets demonstrate the efficacy of our approach. The implementation of this work is publicly available at https://github.com/declare-lab/Multimodal-Infomax.

Abstract PDF Upgrade to Chat

Citations (262)

View on Semantic Scholar

Summary

The paper introduces MultiModal InfoMax, a framework that maximizes hierarchical mutual information to better preserve task-specific features.
It employs a hybrid parametric and non-parametric MI estimation method using neural networks and Gaussian Mixture Models for effective modality fusion.
Performance evaluations on CMU-MOSI and CMU-MOSEI datasets show significant improvements in MAE, Pearson correlation, and classification metrics.

An Expert Overview of "Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis"

The paper "Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis" addresses the perennial challenge in multimodal sentiment analysis (MSA) of effectively fusing information from different modalities—text, visual, and acoustic. This fusion process, which creates enriched multimodal representations, is essential for accurate sentiment analysis yet often fails to retain critical task-related information when conducted using traditional approaches like loss back-propagation or feature space geometric manipulation.

Contributions and Methodology

The authors propose a novel framework, MultiModal InfoMax (MMIM), which aims to hierarchically maximize mutual information (MI) at two levels: between unimodal input pairs (inter-modality) and between multimodal fusion results and unimodal inputs. This hierarchical MI maximization is intended to ensure that task-related information is preserved through the fusion process, thereby enhancing the downstream performance of sentiment analysis tasks.

The authors provide a detailed formulation, tackling the inherent challenges of MI estimation in high-dimensional spaces. They introduce a hybrid parametric and non-parametric method to approximate MI, leveraging neural network-based methods combined with Gaussian Mixture Models (GMMs) for parameter estimation—an innovative approach tailored to handle the intractable computation of exact MI bounds.

Strong Numerical Results

The MMIM framework is evaluated on two widely recognized datasets: CMU-MOSI and CMU-MOSEI. The paper reports that the proposed method achieves superior or comparable performance to state-of-the-art models across several metrics, including mean absolute error (MAE), Pearson correlation, seven-class classification accuracy (Acc-7), and binary classification metrics (Acc-2 and F1 score). Notably, the MMIM shows a significant improvement on all metrics in the CMU-MOSI dataset and excels in accuracy and F1 scores on CMU-MOSEI, reaffirming the efficacy of the hierarchical MI maximization approach.

Implications and Future Directions

This research demonstrates the potential of integrating mutual information theories with multimodal fusion in the field of sentiment analysis. On a practical level, MMIM’s ability to handle raw, unaligned multimodal data without extensive pre-processing stands out as a substantial development, potentially reducing the overhead in real-world applications.

Theoretically, the work contributes to the broader narrative of deep learning by showcasing how MI maximization can reinforce neural network training, guiding the model to focus on task-relevant features while minimizing noise. This approach may inspire future endeavors that explore further integration of information-theoretic principles in neural architectures, not only in sentiment analysis but also in other domains of multimodal learning.

In conclusion, while existing methods in multimodal sentiment analysis offer various strategies for model training and representation learning, this paper highlights the promising utility of mutual information as a guiding criterion to enhance performance. Future research could explore extending this approach to more complex, real-time applications and investigate its adaptability to other multimodal learning tasks.

Markdown