Exposing Text-Image Inconsistency Using Diffusion Models (2404.18033v1)
Abstract: In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation.
- Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14940–14949, 2022.
- Sana Ali. Combatting against covid-19 & misinformation: A systematic review. Human Arenas, pp. 1–16, 2020.
- Cosmos: Catching out-of-context misinformation with self-supervisho2022classifiered learning. arXiv preprint arXiv:2101.06278, 2021.
- Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 397–406, 2021.
- Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023.
- The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111:98–136, 2015.
- Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9716–9725, 2021.
- Large-scale unsupervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Classifier-free diffusion guidance. 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Text-image de-contextualization detection using vision-language models. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8967–8971. IEEE, 2022.
- Multimedia semantic integrity assessment using joint embedding of images and text. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1465–1471, 2017.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. In ICCV, 2023.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017, 2023.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp. 2, 2019.
- Mvae: Multimodal variational autoencoder for fake news detection. In The world wide web conference, pp. 2915–2921, 2019.
- Improving online performance prediction for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1–11, 2021.
- Image-text inconsistency effect on product evaluation in online retailing. Journal of Retailing and Consumer Services, 49:279–288, 2019.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521, 2023.
- Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743, 2020.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471, 2022.
- Newsclippings: Automatic generation of out-of-context multimodal media. arXiv preprint arXiv:2104.05893, 2021.
- Multi-modal semantic inconsistency detection in social media news posts. arXiv preprint arXiv:2105.12855, 2021.
- Multi-modal semantic inconsistency detection in social media news posts. In MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6–10, 2022, Proceedings, Part II, pp. 331–343. Springer, 2022.
- “fake news” is not simply false information: a concept explication and taxonomy of online content. American behavioral scientist, 65(2):180–212, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
- Declare: Debunking fake news and false claims using evidence-aware deep learning. arXiv preprint arXiv:1809.06416, 2018.
- Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 1212–1220, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021a.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021b.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, 2015.
- Deep multimodal image-repurposing detection. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1337–1345, 2018.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Sgdiff: A style guided diffusion model for fashion synthesis. 2023.
- Detecting cross-modal inconsistency to defend against neural fake news. arXiv preprint arXiv:2009.07698, 2020.
- Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4310–4319, 2022.
- Correcting the bias: Mitigating multimodal inconsistency contrastive learning for multimodal fake news detection. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 2861–2866. IEEE, 2023.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588, 2021.
- Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pp. 350–368. Springer, 2022.
- Fact-checking meets fauxtography: Verifying claims about images. arXiv preprint arXiv:1908.11722, 2019.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.