Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes (2404.03022v2)

Published 3 Apr 2024 in cs.CL, cs.CV, cs.IT, cs.LG, and math.IT

Abstract: Memes, combining text and images, frequently use metaphors to convey persuasive messages, shaping public opinion. Motivated by this, our team engaged in SemEval-2024 Task 4, a hierarchical multi-label classification task designed to identify rhetorical and psychological persuasion techniques embedded within memes. To tackle this problem, we introduced a caption generation step to assess the modality gap and the impact of additional semantic information from images, which improved our result. Our best model utilizes GPT-4 generated captions alongside meme text to fine-tune RoBERTa as the text encoder and CLIP as the image encoder. It outperforms the baseline by a large margin in all 12 subtasks. In particular, it ranked in top-3 across all languages in Subtask 2a, and top-4 in Subtask 2b, demonstrating quantitatively strong performance. The improvement achieved by the introduced intermediate step is likely attributable to the metaphorical essence of images that challenges visual encoders. This highlights the potential for improving abstract visual semantics encoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. LM-CPPF: Paraphrasing-guided data augmentation for contrastive prompt-based few-shot fine-tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 670–681, Toronto, Canada. Association for Computational Linguistics.
  2. Lion: Empowering multimodal large language model with dual-level visual knowledge. arXiv preprint arXiv:2311.11860.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. Semeval-2024 task 4: Multilingual detection of persuasion techniques in memes. In Proceedings of the 18th International Workshop on Semantic Evaluation, SemEval 2024, Mexico City, Mexico.
  6. SemEval-2021 task 6: Detection of persuasion techniques in texts and images. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 70–98, Online. Association for Computational Linguistics.
  7. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  9. EunJeong Hwang and Vered Shwartz. 2023. MemeCap: A dataset for captioning and interpreting memes. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1433–1445, Singapore. Association for Computational Linguistics.
  10. BRAINTEASER: Lateral thinking puzzles for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14317–14332, Singapore. Association for Computational Linguistics.
  11. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  12. Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2006, Québec City, Québec, Canada, June 7-9, 2006. Proceedings 19, pages 395–406. Springer.
  13. Anushka Kulkarni. 2017. Internet meme and political discourse: A study on the impact of internet meme as a tool in communicating political satire. Journal of Content, Community & Communication Amity School of Communication, 6.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  15. Visualbert: A simple and performant baseline for vision and language. arxiv 2019. arXiv preprint arXiv:1908.03557.
  16. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  17. Visual instruction tuning. In NeurIPS.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  19. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  20. AIMH at SemEval-2021 task 6: Multimodal classification using an ensemble of transformer models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1020–1026, Online. Association for Computational Linguistics.
  21. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582.
  22. Gpt-4 technical report.
  23. Seokmok Park and Joonki Paik. 2023. Refcap: image captioning with referent objects attributes. Scientific Reports, 13(1):21577.
  24. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  26. Targeted adversarial attacks against neural machine translation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  27. Comparing encoder-only and encoder-decoder transformers for relation extraction from biomedical texts: An empirical study on ten benchmark datasets. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 376–382, Dublin, Ireland. Association for Computational Linguistics.
  28. Mmf: A multimodal framework for vision and language research. https://github.com/facebookresearch/mmf.
  29. Text classification via large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8990–9005, Singapore. Association for Computational Linguistics.
  30. Attention is all you need. Advances in neural information processing systems, 30.
  31. Ben Wasike. 2022. Memes, memes, everywhere, nor any meme to trust: Examining the credibility and persuasiveness of covid-19-related memes. Journal of Computer-Mediated Communication, 27(2):zmab024.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  33. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  34. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems, 36.
  35. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103.
Citations (1)

Summary

  • The paper proposes a novel intermediate caption generation step using GPT-4 to integrate textual and visual features effectively.
  • The study compares models like RoBERTa and LLaVA-1.5 to achieve robust, multilingual classification of persuasion techniques in memes.
  • The results demonstrate that generated captions significantly reduce modality gaps and improve detection of nuanced rhetorical strategies.

Multimodal and Multilingual Exploration of Persuasion Techniques in Memes

Introduction

Memes have emerged as a potent form of communication, especially in shaping public opinion through various persuasion techniques. The BCAmirs team's participation in SemEval-2024 Task 4 seeks to tackle the hierarchical multi-label classification of memes to identify embedded rhetorical and psychological techniques. Their approach involves a novel step of meme caption generation to bridge the modality gap between textual and visual information, significantly improving performance across all tasks.

Background on Modality Gap and Persuasion Technique Classification

The disparity between visual and textual modalities, or the modality gap, has been a focal point of recent studies aiming to enhance multimodal LLMs (MLLMs). Works such as ChatBridge and LION have pushed the boundaries in bridging modalities, demonstrating advancements in multimodal tasks. Similarly, understanding persuasion techniques in memes, as explored by previous studies, underscores the importance of detecting nuanced rhetorical strategies within multimodal content.

Methodology

BCAmirs' methodology centers around an intermediate caption generation step utilizing models like GPT-4, which leverages additional semantic information. The approach compares various models, including language representation models (LRMs) like RoBERTa, and MLLMs such as LLaVA-1.5. This comparison sheds light on how different combinations of textual and visual features influence the understanding of persuasion techniques.

Caption Generation

The initial step involves using models like GPT-4 and LLaVA-1.5 for meme captioning. The generated captions intend to encapsulate the essence of the meme, addressing both the textual content and the metaphorical imagery. This process plays a critical role in bridging the textual and visual modality gap, ostensibly providing a more comprehensive dataset for classification tasks.

Persuasion Technique Classification

Following caption generation, various models are employed to classify persuasion techniques. This stage examines the effectiveness of incorporating generated captions alongside meme text and images. The BCAmirs team explores configurations using only text, text plus images, and text with generated captions, among others, to evaluate the modality gap and the utility of additional semantic information in classification tasks.

Experiments and Results

The team's experiments highlight the effectiveness of their method, particularly when utilizing the ConcatRoBERTa model, which incorporates meme text, images, and GPT-4 generated captions. Notably, this approach outperforms baselines and demonstrates robust performance across different languages in hierarchical classification tasks. The experiments suggest that additional semantic information from captions, especially those generated by GPT-4, significantly aids the classification process.

Conclusion and Future Directions

The BCAmirs team's work offers a novel perspective on classifying persuasion techniques in memes, leveraging the power of generated captions to enhance multimodal classification. Their findings suggest that addressing the modality gap through caption generation can improve the detection of nuanced persuasion techniques. Future research avenues include exploring more advanced models for caption generation and extending the analysis to a broader range of low-resource languages.

This exploration into persuasion techniques within memes signifies a step forward in understanding and mitigating the impact of disinformation campaigns. By enhancing the classification of memes, the research opens doors to more effective tools for recognizing and countering misleading content across social platforms.

X Twitter Logo Streamline Icon: https://streamlinehq.com