SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection (2403.03170v1)
Abstract: Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. While Multimodal LLMs (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation, they still lack sophistication in understanding and discovering the subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel multimodal LLM specifically engineered for OOC misinformation detection and explanation. SNIFFER employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages language-only GPT-4 generated OOC-specific instruction data to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that SNIFFER surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. SNIFFER also provides accurate and persuasive explanations as validated by quantitative and human evaluations.
- Google Vision API. https://cloud.google.com/vision/docs/detecting-web.
- Reading about the Israel-Hamas war on X? Beware fake news. https://wired.me/technology/x-misinformation/, 2023.
- Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 14920–14929. IEEE, 2022.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- COSMOS: catching out-of-context image misuse using self-supervised learning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 14084–14092. AAAI Press, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Socially responsible AI algorithms: Issues, purposes, and challenges. J. Artif. Intell. Res., 71:1137–1181, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Lisa Fazio. Out-of-context photos are a powerful low-tech form of misinformation. https://theconversation.com/out-of-context-photos-are-a-powerful-low-tech-form-of-misinformation-129959, 2020.
- LLaMA-Adapter V2: parameter-efficient visual instruction model. CoRR, abs/2304.15010, 2023.
- The future of false information detection on social media: New perspectives and trends. ACM Comput. Surv., 53(4):68:1–68:36, 2021.
- Lawyer LLaMA technical report. ArXiv, abs/2305.15062, 2023.
- Multimedia semantic integrity assessment using joint embedding of images and text. In Proceedings of the 25th ACM International Conference on Multimedia, MM 2017, Mountain View, CA, USA, October 23-27, 2017, pages 1465–1471. ACM, 2017.
- AIRD: adversarial learning framework for image repurposing detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 11330–11339. Computer Vision Foundation / IEEE, 2019.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Multimodal foundation models: From specialists to general-purpose assistants. CoRR, abs/2309.10020, 2023b.
- LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. CoRR, abs/2306.00890, 2023c.
- LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2023, Toronto, Canada, July 10-12, 2023, pages 31–41. Association for Computational Linguistics, 2023d.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742. PMLR, 2023e.
- VisualBERT: A simple and performant baseline for vision and language. CoRR, abs/1908.03557, 2019.
- C Lin. Recall-oriented understudy for gisting evaluation (rouge). Retrieved August, 20:2005, 2005.
- Visual news: Benchmark and challenges in news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6761–6771. Association for Computational Linguistics, 2021.
- Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives. CoRR, abs/2307.16851, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017.
- NewsCLIPpings: Automatic generation of out-of-context multimodal media. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6801–6817. Association for Computational Linguistics, 2021.
- Detect rumors in microblog posts using propagation structure via kernel learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 708–717. Association for Computational Linguistics, 2017.
- Multimodal analytics for real-world news using measures of cross-modal entity consistency. In Proceedings of the 2020 on International Conference on Multimedia Retrieval, ICMR 2020, Dublin, Ireland, June 8-11, 2020, pages 16–25. ACM, 2020.
- Multimodal news analytics using measures of cross-modal entity and context consistency. Int. J. Multim. Inf. Retr., 10(2):111–125, 2021.
- OpenAI. ChatGPT. https://openai.com/blog/chatgpt/.
- OpenAI. GPT-4 technical report, 2023a.
- OpenAI. GPT-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023b.
- Synthetic misinformers: Generating and combating multimodal misinformation. In Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation, MAD@ICMR 2023, Thessaloniki, Greece, June 12-15, 2023, pages 36–44. ACM, 2023.
- Piotr Przybyla. Capturing the style of fake news. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 490–497. AAAI Press, 2020.
- Exploiting multi-domain visual information for fake news detection. In 2019 IEEE International Conference on Data Mining, ICDM 2019, Beijing, China, November 8-11, 2019, pages 518–527. IEEE, 2019.
- Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. In Proceedings of the 29th ACM International Conference on Multimedia , MM ’21, Virtual Event, China, October 20 - 24, 2021, pages 1212–1220. ACM, 2021.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
- Deep multimodal image-repurposing detection. In Proceedings of the 26th ACM International Conference on Multimedia, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, pages 1337–1345. ACM, 2018.
- Detecting and grounding multi-modal media manipulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6904–6913. IEEE, 2023.
- dEFEND: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pages 395–405. ACM, 2019.
- Deepfakes and beyond: A survey of face manipulation and fake detection. Inf. Fusion, 64:131–148, 2020.
- Categorizing and inferring the relationship between the text and image of Twitter posts. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2830–2840. Association for Computational Linguistics, 2019.
- DeepFake disrupter: The detector of DeepFake is my friend. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 14900–14909. IEEE, 2022.
- EANN: event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 849–857. ACM, 2018.
- Self-Instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics, 2023.
- A survey on multimodal large language models. CoRR, abs/2306.13549, 2023a.
- Woodpecker: Hallucination correction for multimodal large language models. CoRR, abs/2310.16045, 2023b.
- Taoli LLaMA. https://github.com/blcuicall/taoli, 2023.
- Detecting out-of-context multimodal misinformation with interpretable neural-symbolic model. CoRR, abs/2304.07633, 2023.
- Multi-attentional Deepfake detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 2185–2194. Computer Vision Foundation / IEEE, 2021.
- SAFE: similarity-aware multi-modal fake news detection. In Advances in Knowledge Discovery and Data Mining - 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11-14, 2020, Proceedings, Part II, pages 354–367. Springer, 2020.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Fact-checking meets fauxtography: Verifying claims about images. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2099–2108. Association for Computational Linguistics, 2019.
- Peng Qi (55 papers)
- Zehong Yan (2 papers)
- Wynne Hsu (32 papers)
- Mong Li Lee (15 papers)