DocMSU: A Comprehensive Benchmark for Document-level Multimodal Sarcasm Understanding (2312.16023v1)
Abstract: Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU. Our code and dataset are available at https://github.com/Dulpy/DocMSU.
- Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423–443.
- Modelling Sarcasm in Twitter, a Novel Approach. 50–58.
- Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. 2506–2515.
- Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4619–4629. Florence, Italy: Association for Computational Linguistics.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 4171–4186.
- YOLOX: Exceeding YOLO Series in 2021. arXiv e-prints, arXiv:2107.08430.
- Identity Mappings in Deep Residual Networks. ArXiv preprint, abs/1603.05027.
- Local Relation Networks for Image Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3464–3473.
- Sarcasm Target Identification: Dataset and An Introductory Approach.
- Harnessing Context Incongruity for Sarcasm Detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 757–762. Beijing, China: Association for Computational Linguistics.
- A Large Self-Annotated Corpus for Sarcasm. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5583–5594. PMLR.
- Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726.
- VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355.
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Vedaldi, A.; Bischof, H.; Brox, T.; and Frahm, J.-M., eds., Computer Vision – ECCV 2020, 121–137. Cham: Springer International Publishing. ISBN 978-3-030-58577-8.
- Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1767–1777. Dublin, Ireland: Association for Computational Linguistics.
- Microsoft COCO: Common Objects in Context. In Fleet, D.; Pajdla, T.; Schiele, B.; and Tuytelaars, T., eds., Computer Vision – ECCV 2014, 740–755. Cham: Springer International Publishing. ISBN 978-3-319-10602-1.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- A Joint Training Dual-MRC Framework for Aspect Based Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15): 13543–13551.
- doccano: Text Annotation Tool for Human. Software available from https://github.com/doccano/doccano.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- iSarcasm: A Dataset of Intended Sarcasm. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1279–1289. Online: Association for Computational Linguistics.
- Sarcasm Detection on Czech and English Twitter. 213–223.
- FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12): 14444–14452.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
- Sarcasm as Contrast between a Positive Sentiment and Negative Situation. 704–714.
- Fake News Detection on Social Media: A Data Mining Perspective. Sigkdd Explorations.
- Multimodal Sarcasm Target Identification in Tweets. 8164–8175.
- Wilson, D. 2006. The pragmatics of verbal irony: Echo or pretence? Lingua, 116(10): 1722–1743. Language in Mind: A Tribute to Neil Smith on the Occasion of his Retirement.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
- Bootstrapping Multi-View Representations for Fake News Detection. In AAAI Conference on Artificial Intelligence.
- UnitBox: An Advanced Object Detection Network. Proceedings of the 24th ACM international conference on Multimedia.
- Tweet Sarcasm Detection Using Deep Neural Network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2449–2460. Osaka, Japan: The COLING 2016 Organizing Committee.
- Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Transactions on Cybernetics, 52(8): 8574–8586.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.