Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting BERT For Multimodal Target Sentiment Classification Through Input Space Translation (2108.01682v2)

Published 3 Aug 2021 in cs.CL and cs.CV

Abstract: Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a LLM. Our approach increases the amount of text available to the LLM and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the LLM to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. Our code is available at \textcolor{blue}{\url{https://github.com/codezakh/exploiting-BERT-thru-translation}}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zaid Khan (16 papers)
  2. Yun Fu (131 papers)
Citations (116)

Summary

  • The paper introduces input space translation using a captioning transformer to integrate visual cues into BERT without altering its architecture.
  • It achieves state-of-the-art performance on Twitter-15 and Twitter-17, with accuracies up to 78.01% and macro F1 scores reaching 73.25%.
  • The approach simplifies multimodal fusion by converting images into natural language descriptions, paving the way for robust sentiment analysis in social media.

Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation

This paper introduces a novel approach to multimodal target/aspect sentiment classification by leveraging the BERT LLM in combination with a two-stream model that translates images into text input space using an object-aware transformer. This translation augments textual representation and provides multimodal information to a LLM through an auxiliary sentence mechanism. The research addresses the challenge of analyzing sentiment directed toward specific targets within social media content, which is inherently multimodal and often accompanied by complex or irrelevant images.

The authors propose using a captioning transformer architecture to translate visual information from tweets into natural language descriptions. This translation enriches the LLM's understanding by increasing the text accessible for sentiment analysis. The auxiliary sentence methodology ensures that multimodal information, including both text and images, can be effectively fed into the BERT LLM, thereby allowing sentiment classification without modifying BERT to natively accept multimodal data.

Numerical Results and Experimental Evaluation

The experimental evaluation shows that the proposed approach achieves state-of-the-art performance on two benchmark multimodal Twitter datasets: Twitter-15 and Twitter-17, demonstrating significant improvements over existing text-only models like BERT and other multimodal models such as TomBERT. In the Twitter-15 dataset, the proposed EF-CaTrBERT configuration improved accuracy to 78.01% with a macro F1 score of 73.25%. On the Twitter-17 dataset, EF-CaTrBERT-DE, an enhanced domain-specific variant, achieved an accuracy of 72.3% with a macro F1 score of 70.2%.

Implications and Future Developments

The results indicate that input space translation is a compelling method to handle multimodal sentiment analysis tasks in social media contexts where images and text interplay. This technique not only leverages existing LLMs effectively without architectural modification but also simplifies multimodal fusion through readable natural language text representations of images. By doing so, it provides a path forward for more interpretive and robust applications of sentiment analysis in increasingly complex multimodal environments.

The approach sets a precedent for leveraging visual data in sentiment analysis without requiring changes to LLMs, thus keeping the training stable and reproducible across various implementations. This flexibility might drive further research into applying similar translation techniques across other domains requiring multimodal information fusion, such as opinion mining in video content or sentiment evaluation in real-time streaming services.

Conclusion

The paper contributes a meaningful advancement in the task of multimodal sentiment analysis, offering insights into handling short texts with accompanying visuals—specifically pertinent to the social media landscape. Future work might explore extending this translation and fusion technique to other types of multimedia data beyond static images, while also addressing further challenges of noise and irrelevance in visual modalities.