Tensor Fusion Network for Multimodal Sentiment Analysis (1707.07250v1)

Published 23 Jul 2017 in cs.CL

Abstract: Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Authors (5)

Amir Zadeh (36 papers)
Minghai Chen (3 papers)
Soujanya Poria (138 papers)
Erik Cambria (136 papers)
Louis-Philippe Morency (123 papers)

Citations (1,098)

View on Semantic Scholar

Summary

Tensor Fusion Network for Multimodal Sentiment Analysis

The paper "Tensor Fusion Network for Multimodal Sentiment Analysis" presents a novel approach for sentiment analysis in videos by explicitly modeling both intra-modality and inter-modality dynamics. The authors propose the Tensor Fusion Network (TFN), which represents a significant advancement over traditional early fusion or late fusion techniques in multimodal sentiment analysis.

Problem Statement

Multimodal sentiment analysis extends traditional text-based sentiment analysis to incorporate multiple modalities typically found in opinionated videos: language (spoken words), visual (gestures), and acoustic (voice). This new setup poses two central challenges:

Inter-Modality Dynamics: These are the interactions between different modalities that can alter the sentiment perception. For instance, the sentence "This movie is sick" can have different sentiments based on accompanying facial expressions or intonation.
Intra-Modality Dynamics: These concern the intra-modality structure peculiar to each type of data. Spoken language, for instance, is often less structured than written text, presenting unique challenges.

Proposed Approach

The Tensor Fusion Network (TFN) is designed to address both intra-modality and inter-modality dynamics in an end-to-end manner. TFN consists of three key components:

Modality Embedding Subnetworks: These subnetworks independently process each modality (language, visual, acoustic) to generate rich embeddings. For language, a Long Short-Term Memory (LSTM) network is employed to handle the sequential nature of spoken language. The visual and acoustic modalities are processed through deep neural networks that handle the respective feature complexities.
Tensor Fusion Layer: This critical innovation models the complex interplay between modalities by constructing a tensor from the Cartesian product of the three modality embeddings. This tensor explicitly captures unimodal, bimodal, and trimodal interactions, thereby allowing the network to learn intricate inter-modality dynamics effectively.
Sentiment Inference Subnetwork: Based on the tensor generated by the Tensor Fusion Layer, this subnetwork performs the sentiment analysis tasks, which include binary sentiment classification, five-class sentiment classification, and sentiment regression.

Experimental Results

The authors extensively evaluate TFN on the CMU-MOSI dataset, demonstrating superior performance compared to several state-of-the-art methods:

Binary Sentiment Classification: TFN achieves 77.1% accuracy, outperforming previous models such as C-MKL and SAL-CNN by around 4%.
Five-Class Sentiment Classification: TFN reaches 42.0% accuracy, significantly improving the classification performance.
Sentiment Regression: TFN achieves a mean absolute error (MAE) of 0.87 and a correlation coefficient (r) of 0.70, reflecting its robustness in regression tasks.

The results indicate that TFN's approach of capturing both unimodal and multi-modal dynamics results in more accurate sentiment predictions.

Implications and Future Work

From a practical standpoint, TFN has the potential to improve sentiment analysis applications, especially in social media monitoring and content recommendation systems, where video content is increasingly prevalent. The ability to accurately gauge sentiment from multimodal inputs can better tailor user experiences and automate content evaluation.

Theoretically, the integration of tensor-based multimodal fusion sets a precedent for future work in this domain. The explicit modeling of all interaction types within the Tensor Fusion Layer opens avenues for exploring other complex multimodal tasks beyond sentiment analysis.

In the future, developments could focus on:

Scaling: Adapting TFN to handle even larger datasets and more diverse modalities, such as physiological sensors.
Real-time Processing: Enhancing the efficiency of TFN to allow real-time sentiment analysis, which is crucial for live streaming and real-time user feedback systems.
Generalization: Applying the TFN framework to different languages and cultural contexts to test its universality and adaptability.

Overall, the Tensor Fusion Network marks a notable advancement in the field of multimodal sentiment analysis, offering an effective solution to the complex challenge of integrating multiple sources of information to understand sentiment.

PDF Markdown