WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge

Published 12 Jan 2024 in cs.CL | (2401.06659v2)

Abstract: Sentiment analysis is rapidly advancing by utilizing various data modalities (e.g., text, image). However, most previous works relied on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs) and thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we proposed a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-LLMs (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. To reduce the noise in the context, we also introduce a training-free contextual fusion mechanism. Experiments across diverse granularities of MSA tasks consistently demonstrate that our approach has substantial improvements (brings an average +1.96% F1 score among five advanced methods) over several state-of-the-art methods.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper presents a novel framework that augments multimodal sentiment analysis by integrating contextual world knowledge with a training-free mechanism.
It leverages large vision-language models and prompt templates to generate and fuse relevant context, improving performance on benchmarks like Twitter2015 and Twitter2017.
Experimental results demonstrate nearly 2% F1 score improvements, highlighting the framework's robustness and adaptability across diverse sentiment analysis tasks.

WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge

Introduction

The development of multimodal sentiment analysis (MSA) has traditionally relied on surface-level data extracted from text and images. However, these methods often overlook the importance of contextual world knowledge, which can enrich sentiment analysis by providing deeper insights beyond the immediate content of the input data. "WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge" proposes a novel approach that leverages large vision-LLMs (LVLMs) to incorporate such contextual knowledge, thus improving the performance of MSA tasks.

Figure 1: The simple schematic of our method. The sentiment polarity of Aleppo is negative, which existing methods fail to predict accurately without incorporating context.

Methodology

The WisdoM framework is a plug-in solution designed to enhance MSA by integrating contextual world knowledge through a series of structured stages:

Prompt Templates Generation: Leveraging LLMs like ChatGPT to generate prompts that guide LVLMs in generating relevant context.
Context Generation: Utilizing LVLMs to produce contextual information that complements the input data by providing background knowledge.
Contextual Fusion: Implementing a training-free mechanism that selectively incorporates context into the analysis, reducing noise and enhancing accuracy.
Figure 2: Detailed illustration of our proposed schema with a running example.

Experimental Analysis

The proposed methodology was rigorously tested against multiple benchmarks, such as Twitter2015, Twitter2017, and MSED, demonstrating an average improvement of +1.96% in F1 scores compared to state-of-the-art methods.

Figure 3: Comparative winning rates of Our Context v.s. RAG-based methods on Twitter2015 and Twitter2017 benchmarks.

Results and Discussion

WisdoM shows significant enhancements in sentiment analysis performance, particularly in scenarios requiring deep context comprehension. The use of historical and cultural knowledge stands out as particularly beneficial when addressing sentiment-laden subjects embedded in broader narratives.

Figure 4: Effects of different types of world knowledge on the F1-score by applying WisdoM to AoM.

Performance Evaluation

Performance improvements were consistent across different model architectures, indicating the robustness and adaptability of WisdoM. The integration of context not only increased sentiment classification accuracy but also demonstrated effective disambiguation in challenging samples previously considered hard by state-of-the-art models.

Figure 5: Comparison of context ( $S_{c \rightarrow p}$ ) and input's ( $S_{i \rightarrow p}$ ) correlation to final predictions across layers.

Conclusion

WisdoM exemplifies the potential for enhanced MSA by utilizing contextual world knowledge, demonstrating notable improvements in both aspect-level and sentence-level sentiment analysis tasks. As AI continues to evolve, methodologies like WisdoM pave the way for more nuanced and contextually aware systems capable of deeper understanding and more accurate sentiment evaluation.

By incorporating nuanced context, WisdoM addresses limitations present in traditional MSA frameworks, providing an efficient, scalable, and adaptable solution that enhances sentiment analysis capabilities across diverse datasets.

Markdown