Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability (2402.11159v3)
Abstract: This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and LLMs, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors' visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and LLMs. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24.
- Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14940–14949.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Image-text retrieval: A survey on recent research and development. arXiv preprint arXiv:2203.14713.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- How does fake news use a thumbnail? CLIP-based multimodal detection on the unrepresentative news image. In Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations, pages 86–94, Dublin, Ireland. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Nela-gt-2021: A large multi-labelled news dataset for the study of misinformation in news articles. ArXiv preprint, abs/2203.05659.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- Mixgen: A new multi-modal data augmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 379–389.
- Share, like, recommend: Decoding the social media news consumer. Journalism studies, 13(5-6):815–824.
- CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Fighting fake news: Image splice detection via learned self-consistency. In Proceedings of the European conference on computer vision (ECCV), pages 101–117.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
- Hélène Joffe. 2008. The power of visual material: Persuasion, emotion and identification. Diogenes, 55(1):84–93.
- Chei Sian Lee and Long Ma. 2012. News sharing in social media: The effect of gratifications and prior experience. Computers in human behavior, 28(2):331–339.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2592–2607, Online. Association for Computational Linguistics.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International Conference on Learning Representations.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6801–6817, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Factify: A multi-modal fact verification dataset. In Proceedings of the First Workshop on Multimodal Fact-Checking and Hate Speech Detection (DE-FACTIFY).
- Multimodal analytics for real-world news using measures of cross-modal entity consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pages 16–25.
- Nonprobative photographs (or words) inflate truthiness. Psychonomic Bulletin & Review, 19:969–974.
- Eryn J Newman and Lynn Zhang. 2020. How non-probative photos shape belief. Cognitive Science, page 90.
- EASE: Entity-aware contrastive learning of sentence embedding. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3870–3885, Seattle, United States. Association for Computational Linguistics.
- Representation learning with contrastive predictive coding. ArXiv preprint, abs/1807.03748.
- How-to present news on social media: A causal analysis of editing news headlines for boosting user engagement. In ICWSM, pages 491–502.
- Who blames or endorses whom? entity-to-entity directed sentiment extraction in news text. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4091–4102.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Improving language understanding by generative pre-training.
- Factify-5wqa: 5w aspect-based fact verification through question answering. arXiv preprint arXiv:2305.04329.
- Contrastive learning with hard negative samples. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Kiwon Seo. 2020. Meta-analysis on visual persuasion–does adding images to texts influence persuasion. Athens Journal of Mass Media and Communications, 6(3):177–190.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
- Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3):171–188.
- The Associate Press. 2022. The AP Stylebook: 2022-2024. Basic Books.
- Large scale multi-lingual multi-modal summarization dataset. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3620–3632, Dubrovnik, Croatia. Association for Computational Linguistics.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
- End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2733–2743.
- Contrastive learning with positive-negative frame mask for music representation. In Proceedings of the ACM Web Conference 2022, pages 2906–2915.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Fact-checking meets fauxtography: Verifying claims about images. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics.