Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms (2402.14154v3)

Published 21 Feb 2024 in cs.CL, cs.CV, and cs.CY

Abstract: Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal LLMs (MLLMs) have emerged as a promising solution to these challenges, yet they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models' social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. Our code and data are available at https://github.com/claws-lab/MMSoc.git.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
  2. Multimodal depression detection: fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Transactions on Affective Computing, 9(4):478–490.
  3. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv:2308.01390.
  5. Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations. arXiv:2304.14827.
  6. Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark. In EMNLP 2023.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv:2305.06500.
  8. How ready are pre-trained abstractive models and llms for legal case judgement summarization? arXiv:2306.01248.
  9. Fate-llm: A industrial grade federated learning framework for large language models. arXiv:2310.10049.
  10. Emilio Ferrara. 2020. Dynamics of Attention and Public Opinion in Social Media. In The Oxford Handbook of Networked Communication. Oxford University Press.
  11. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv:2304.15010.
  12. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv:2303.15056.
  13. A survey on the role of crowds in combating online misinformation: Annotators, evaluators, and creators. arXiv:2310.02095.
  14. Racism is a virus: Anti-asian hate and counterspeech in social media during the covid-19 crisis. In ASONAM, pages 90–94.
  15. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR.
  16. The curious case of neural text degeneration. In ICLR.
  17. Lora L Jacobi. 2014. Perceptions of profanity: How race, gender, and expletive choice affect perceived offensiveness. North American Journal of Psychology, 16(2).
  18. Sepehr Janghorbani and Gerard De Melo. 2023. Multi-modal bias: Introducing a framework for stereotypical bias assessment beyond gender and race in vision–language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1725–1735, Dubrovnik, Croatia. Association for Computational Linguistics.
  19. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. arXiv e-prints, pages arXiv–2310.
  20. Predicting information pathways across online communities. In KDD.
  21. Towards fine-grained reasoning for fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5746–5754.
  22. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
  23. The hateful memes challenge: Detecting hate speech in multimodal memes. NeurIPS, 33:2611–2624.
  24. Grzegorz Kondrak. 2005. N-gram similarity and distance. In International symposium on string processing and information retrieval, pages 115–126. Springer.
  25. The significance of recall in automatic metrics for mt evaluation. In AMTA, pages 134–143. Springer.
  26. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
  27. The good, the bad, and why: Unveiling emotions in generative ai. arXiv:2312.11111.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597.
  29. Zone: Zero-shot instruction-guided local editing. arXiv preprint arXiv:2312.16794.
  30. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  31. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv:2311.10774.
  32. Visual news: Benchmark and challenges in news image captioning. arXiv:2010.03743.
  33. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  34. Visual instruction tuning. arXiv:2304.08485.
  35. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.
  36. Edward Loper and Steven Bird. 2002. Nltk: the natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, pages 63–70.
  37. Curriculum contrastive learning for fake news detection. In CIKM, pages 4309–4313.
  38. A measurement study of hate speech in social media. In Proceedings of the 28th ACM conference on hypertext and social media, pages 85–94.
  39. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318.
  40. Yarn: Efficient context window extension of large language models. arXiv:2309.00071.
  41. Alexander Peysakhovich and Adam Lerer. 2023. Attention sorting combats recency bias in long context language models. arXiv:2310.01427.
  42. Ai and the everything in the whole wide world benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  43. Retrieval-augmented image captioning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3666–3681, Dubrovnik, Croatia. Association for Computational Linguistics.
  44. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In EMNLP-IJCNLP, pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  45. Willibald Ruch. 2010. The sense of humor: Explorations of a personality characteristic, volume 3. Walter de Gruyter.
  46. Gabriele Ruggeri and Debora Nozza. 2023. A multi-dimensional study on bias in vision-language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6445–6455, Toronto, Canada. Association for Computational Linguistics.
  47. Can gpt-4 support analysis of textual data in tasks requiring highly specialized domain expertise? arXiv:2306.13906.
  48. Semeval-2020 task 8: Memotion analysis-the visuo-lingual metaphor! In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 759–773.
  49. You don’t need a personality test to know these models are unreliable: Assessing the reliability of large language models on psychometric instruments. arXiv:2311.09718.
  50. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3):171–188.
  51. Maria Grazia Sindoni. 2020. ‘# youcantalk’: A multimodal discourse analysis of suicide prevention and peer support in the australian beyondblue platform. Discourse & Communication, 14(2):202–221.
  52. Aligning with whom? large language models have gender and racial biases in subjective nlp tasks. arXiv:2311.09730.
  53. Learning the visualness of text using large vision-language models. In EMNLP.
  54. The spread of true and false news online. science, 359(6380):1146–1151.
  55. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. NeurIPS, 33:5776–5788.
  56. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. arXiv:2401.10529.
  57. Abspyramid: Benchmarking the abstraction ability of language models with a unified entailment graph. arXiv:2311.09174.
  58. Zeerak Waseem. 2016. Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science, pages 138–142.
  59. Transformers: State-of-the-art natural language processing. In EMNLP, pages 38–45.
  60. Large language models can be good privacy protection learners. arXiv:2310.02469.
  61. Multi-modal transformer for fake news detection. Mathematical Biosciences and Engineering: MBE, 20(8):14699–14717.
  62. Reinforcement subgraph reasoning for fake news detection. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2253–2262.
  63. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490.
  64. On the origins of memes by means of fringe web communities. In Proceedings of the Internet Measurement Conference 2018, IMC ’18, page 188–202, New York, NY, USA. Association for Computing Machinery.
  65. Ipdreamer: Appearance-controllable 3d object generation with image prompts. arXiv preprint arXiv:2310.05375.
  66. Foundation model-oriented robustness: Robust image model evaluation with pretrained models. arXiv preprint arXiv:2308.10632.
  67. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv:2303.16199.
  68. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  69. Fedpetuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9963–9977.
  70. Competeai: Understanding the competition behaviors in large language model-based agents. arXiv:2310.17512.
  71. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv:2311.04199.
  72. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592.
  73. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv:2306.04528.
  74. Tilfa: A unified framework for text, image, and layout fusion in argument mining. EMNLP 2023, page 139.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yiqiao Jin (27 papers)
  2. Minje Choi (13 papers)
  3. Gaurav Verma (34 papers)
  4. Jindong Wang (150 papers)
  5. Srijan Kumar (61 papers)
Citations (13)