Papers
Topics
Authors
Recent
2000 character limit reached

DocMSU: A Comprehensive Benchmark for Document-level Multimodal Sarcasm Understanding (2312.16023v1)

Published 26 Dec 2023 in cs.CL and cs.MM

Abstract: Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU. Our code and dataset are available at https://github.com/Dulpy/DocMSU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423–443.
  2. Modelling Sarcasm in Twitter, a Novel Approach. 50–58.
  3. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. 2506–2515.
  4. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4619–4629. Florence, Italy: Association for Computational Linguistics.
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 4171–4186.
  6. YOLOX: Exceeding YOLO Series in 2021. arXiv e-prints, arXiv:2107.08430.
  7. Identity Mappings in Deep Residual Networks. ArXiv preprint, abs/1603.05027.
  8. Local Relation Networks for Image Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3464–3473.
  9. Sarcasm Target Identification: Dataset and An Introductory Approach.
  10. Harnessing Context Incongruity for Sarcasm Detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 757–762. Beijing, China: Association for Computational Linguistics.
  11. A Large Self-Annotated Corpus for Sarcasm. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).
  12. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5583–5594. PMLR.
  13. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
  14. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726.
  15. VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355.
  16. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J.-M., eds., Computer Vision – ECCV 2020, 121–137. Cham: Springer International Publishing. ISBN 978-3-030-58577-8.
  17. Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1767–1777. Dublin, Ireland: Association for Computational Linguistics.
  18. Microsoft COCO: Common Objects in Context. In Fleet, D.; Pajdla, T.; Schiele, B.; and Tuytelaars, T., eds., Computer Vision – ECCV 2014, 740–755. Cham: Springer International Publishing. ISBN 978-3-319-10602-1.
  19. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  20. A Joint Training Dual-MRC Framework for Aspect Based Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15): 13543–13551.
  21. doccano: Text Annotation Tool for Human. Software available from https://github.com/doccano/doccano.
  22. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  23. iSarcasm: A Dataset of Intended Sarcasm. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1279–1289. Online: Association for Computational Linguistics.
  24. Sarcasm Detection on Czech and English Twitter. 213–223.
  25. FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12): 14444–14452.
  26. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
  27. Sarcasm as Contrast between a Positive Sentiment and Negative Situation. 704–714.
  28. Fake News Detection on Social Media: A Data Mining Perspective. Sigkdd Explorations.
  29. Multimodal Sarcasm Target Identification in Tweets. 8164–8175.
  30. Wilson, D. 2006. The pragmatics of verbal irony: Echo or pretence? Lingua, 116(10): 1722–1743. Language in Mind: A Tribute to Neil Smith on the Occasion of his Retirement.
  31. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
  32. Bootstrapping Multi-View Representations for Fake News Detection. In AAAI Conference on Artificial Intelligence.
  33. UnitBox: An Advanced Object Detection Network. Proceedings of the 24th ACM international conference on Multimedia.
  34. Tweet Sarcasm Detection Using Deep Neural Network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2449–2460. Osaka, Japan: The COLING 2016 Organizing Committee.
  35. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Transactions on Cybernetics, 52(8): 8574–8586.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.