Mapping Memes to Words for Multimodal Hateful Meme Classification

Published 12 Oct 2023 in cs.CV | (2310.08368v1)

Abstract: Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-LLM and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets. The code and the pre-trained models are publicly available at https://github.com/miccunifi/ISSUES.

Abstract PDF Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces ISSUES, a framework combining CLIP and textual inversion to create enriched multimodal representations for hateful meme classification.
The method employs linear adaptations and a Combiner network to fuse text and imagery, achieving superior accuracy and AUROC compared to Hate-CLIPper.
Experimental results confirm ISSUES delivers state-of-the-art performance on Hateful Memes Challenge datasets, highlighting enhanced model generalizability.

Mapping Memes to Words for Multimodal Hateful Meme Classification

Introduction

The task of classifying multimodal hateful memes involves the complex challenge of interpreting the combined meaning of text and images. This paper introduces ISSUES, a method leveraging advanced multimodal techniques to classify hateful content more effectively. ISSUES builds on the CLIP vision-LLM and incorporates the textual inversion technique to create a more robust representation for meme classification.

Proposed Approach

ISSUES leverages a pre-trained CLIP model, which aligns visual and textual data in a common embedding space. This approach is further enhanced with textual inversion, a technique that maps an image into a pseudo-word token in the CLIP embedding space, utilizing the SEARLE textual inversion network for efficient token generation.

Figure 1: Examples of multimodal image-text memes. Given a meme, we want to classify whether its content conveys hate. The proposed ISSUES approach is more effective at evaluating the hatefulness of the memes than the state-of-the-art method Hate-CLIPper.

Textual Inversion Enhancement

By applying textual inversion, ISSUES enriches textual features to include multimodal information within the textual embedding domain. This is achieved by mapping images of memes into pseudo-word tokens that capture both visual and textual semantics, creating a more comprehensive representation.

Figure 2: Overview of the proposed approach. We disentangle CLIP common embedding space via linear projections. We employ textual inversion to make the textual representation multimodal. We fuse the textual and visual features with a Combiner architecture.

Disentangling Embedding Spaces

The model employs linear projections to adapt the pre-trained CLIP embedding spaces, allowing separate, task-oriented tuning of image and text features. A two-stage training strategy is used to optimize these projections, ensuring that multimodal representations are effectively disentangled.

Multimodal Fusion Network

The Combiner network is used to fuse the multimodal representations. It takes the adapted visual and textual embeddings and integrates them into a meaningful representation. This fusion is critical for accurately capturing the nuanced interactions between meme text and imagery, facilitating more effective classification.

Experimental Results

The experiments demonstrate that ISSUES exceeds current state-of-the-art performance on the Hateful Memes Challenge and HarMeme datasets. In particular, ISSUES outperforms the widely recognized Hate-CLIPper model, illustrating its superior capability in handling the intricacies of multimodal meme classification.

Quantitative results on the HMC test set:

Method	Accuracy	AUROC
CLIP Text-Only	63.50	63.43
CLIP Image-Only	74.65	81.35
ISSUES	77.70	85.51

The results from the HarMeme dataset further underscore the model's generalizability, achieving strong performance in real-world, uncurated meme scenarios.

Conclusion

ISSUES presents a robust framework for classifying hateful memes by enhancing the multimodal capabilities of the CLIP model through textual inversion and effective embedding adaptation. Its state-of-the-art performance on challenging datasets confirms the potential of leveraging powerful vision-LLMs and advanced fusion strategies for complex classification tasks. Future research may explore extending these methods to broader multimodal content understanding and further refining model efficiency and deployment capability.

Markdown