Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse (2401.01523v3)

Published 3 Jan 2024 in cs.CL and cs.AI

Abstract: The exponential growth of social media has profoundly transformed how information is created, disseminated, and absorbed, exceeding any precedent in the digital age. Regrettably, this explosion has also spawned a significant increase in the online abuse of memes. Evaluating the negative impact of memes is notably challenging, owing to their often subtle and implicit meanings, which are not directly conveyed through the overt text and imagery. In light of this, large multimodal models (LMMs) have emerged as a focal point of interest due to their remarkable capabilities in handling diverse multimodal tasks. In response to this development, our paper aims to thoroughly examine the capacity of various LMMs (e.g., GPT-4V) to discern and respond to the nuanced aspects of social abuse manifested in memes. We introduce the comprehensive meme benchmark, GOAT-Bench, comprising over 6K varied memes encapsulating themes such as implicit hate speech, sexism, and cyberbullying, etc. Utilizing GOAT-Bench, we delve into the ability of LMMs to accurately assess hatefulness, misogyny, offensiveness, sarcasm, and harmful content. Our extensive experiments across a range of LMMs reveal that current models still exhibit a deficiency in safety awareness, showing insensitivity to various forms of implicit abuse. We posit that this shortfall represents a critical impediment to the realization of safe artificial intelligence. The GOAT-Bench and accompanying resources are publicly accessible at https://goatlmm.github.io/, contributing to ongoing research in this vital field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957.
  2. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
  3. Predicting anti-asian hateful users on twitter during covid-19. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4655–4666.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  6. Introducing our multimodal models.
  7. Gpt-neox-20b: An open-source autoregressive language model.
  8. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 1877–1901.
  9. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2506–2515.
  10. All-in-one: A deep attentive multi-task learning framework for humour, sarcasm, offensive, motivation, and sentiment on memes. In Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pages 281–290.
  11. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
  15. Old jokes, new media–online sexism and constructions of gender in internet memes. Feminism & psychology, 28(1):109–127.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  17. Semeval-2022 task 5: Multimedia automatic misogyny identification. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 533–549.
  18. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  19. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  21. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  22. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In Companion Proceedings of the ACM Web Conference 2023, pages 294–297.
  23. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  24. The hateful memes challenge: detecting hate speech in multimodal memes. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 2611–2624.
  25. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  26. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  27. Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019.
  28. Self-alignment with instruction backtranslation.
  29. Beneath the surface: Unveiling harmful memes with multimodal reasoning distilled from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9114–9128.
  30. Zero-shot rumor detection with propagation structure via prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5213–5221.
  31. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  32. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  33. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  34. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
  35. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  36. Orca: Progressive learning from complex explanation traces of gpt-4.
  37. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  38. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  39. Detecting harmful memes and their targets. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2783–2796.
  40. Momenta: A multimodal framework for detecting harmful memes and their targets. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4439–4455.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  42. Improving language understanding by generative pre-training.
  43. Scaling language models: Methods, analysis & insights from training gopher.
  44. Detecting and understanding harmful memes: A survey. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 5597–5606.
  45. Multimodal meme dataset (multioff) for identifying offensive content in image and text. In Proceedings of the second workshop on trolling, aggression and cyberbullying, pages 32–41.
  46. Ul2: Unifying language learning paradigms.
  47. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
  51. Self-instruct: Aligning language models with self-generated instructions.
  52. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  53. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  54. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
  55. Wizardlm: Empowering large language models to follow complex instructions.
  56. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1).
  57. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  58. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687.
  59. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  60. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731.
  61. Glm-130b: An open bilingual pre-trained model.
  62. Eline Zenner and Dirk Geeraerts. 2018. One does not simply process memes: Image macros as multimodal constructions. Cultures and traditions of wordplay and wordplay research, pages 167–194.
  63. Opt: Open pre-trained transformer language models.
  64. Lima: Less is more for alignment.
  65. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hongzhan Lin (33 papers)
  2. Ziyang Luo (35 papers)
  3. Bo Wang (823 papers)
  4. Ruichao Yang (9 papers)
  5. Jing Ma (136 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com