Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection (2407.21004v3)
Abstract: Recent advances show that two-stream approaches have achieved outstanding performance in hateful meme detection. However, hateful memes constantly evolve as new memes emerge by fusing progressive cultural ideas, making existing methods obsolete or ineffective. In this work, we explore the potential of Large Multimodal Models (LMMs) for hateful meme detection. To this end, we propose Evolver, which incorporates LMMs via Chain-of-Evolution (CoE) Prompting, by integrating the evolution attribute and in-context information of memes. Specifically, Evolver simulates the evolving and expressing process of memes and reasons through LMMs in a step-by-step manner. First, an evolutionary pair mining module retrieves the top-k most similar memes in the external curated meme set with the input meme. Second, an evolutionary information extractor is designed to summarize the semantic regularities between the paired memes for prompting. Finally, a contextual relevance amplifier enhances the in-context hatefulness information to boost the search for evolutionary processes. Extensive experiments on public FHM, MAMI, and HarM datasets show that CoE prompting can be incorporated into existing LMMs to improve their performance. More encouragingly, it can serve as an interpretive tool to promote the understanding of the evolution of social memes. Homepage
- Multimodal Hate Speech Detection in Memes using Contrastive Language-Image Pre-training. IEEE Access (2024).
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023).
- DeepHate: Hate speech detection via multi-faceted text representations. In Proceedings of the 12th ACM Conference on Web Science. 11–20.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
- Richard Dawkins. 2016a. The extended selfish gene. Oxford University Press.
- Richard Dawkins. 2016b. The selfish gene. Oxford university press.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- On explaining multimodal hateful meme detection models. In Proceedings of the ACM Web Conference 2022. 3651–3655.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems 33 (2020), 2611–2624.
- Gokul Karthik Kumar and Karthik Nandakumar. 2022. Hate-CLIPper: Multimodal hateful meme classification based on cross-modal interaction of CLIP features. arXiv preprint arXiv:2210.05916 (2022).
- Marne Levine. 2013. Controversial, harmful and hateful speech on Facebook. Internet: https://www. facebook. com/notes/facebook-safety/controversial-harmful-and-hateful-speech-on-facebook/574430655911054 (24.3. 2014) (2013).
- Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 1, 2 (2023), 2.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
- A multimodal framework for the detection of hateful memes. arXiv preprint arXiv:2012.12871 (2020).
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
- Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
- GPT-4V(ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547 (2023).
- OpenAI. 2023. GPT-4V(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031
- MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets. In Findings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 4439–4455. https://doi.org/10.18653/v1/2021.findings-emnlp.379
- On the Evolution of (Hateful) Memes by Means of Multimodal Contrastive Learning. In IEEE Symposium on Security and Privacy (S&P). IEEE.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- The dawn of LMMs: Preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 9, 1 (2023), 1.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023).
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).