NewsCaption: Named-Entity aware Captioning for Out-of-Context Media (2403.12618v1)
Abstract: With the increasing influence of social media, online misinformation has grown to become a societal issue. The motivation for our work comes from the threat caused by cheapfakes, where an unaltered image is described using a news caption in a new but false-context. The main challenge in detecting such out-of-context multimedia is the unavailability of large-scale datasets. Several detection methods employ randomly selected captions to generate out-of-context training inputs. However, these randomly matched captions are not truly representative of out-of-context scenarios due to inconsistencies between the image description and the matched caption. We aim to address these limitations by introducing a novel task of out-of-context caption generation. In this work, we propose a new method that generates a realistic out-of-context caption given visual and textual context. We also demonstrate that the semantics of the generated captions can be controlled using the textual context. We also evaluate our method against several baselines and our method improves over the image captioning baseline by 6.2% BLUE-4, 2.96% CiDEr, 11.5% ROUGE, and 7.3% METEOR
- Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS), pages 1–7. IEEE, 2018.
- Protecting world leaders against deep fakes. In CVPR workshops, page 38, 2019.
- Open AI. Radford, alec and kim, jong wook and hallacy, chris and ramesh, aditya and goh, gabriel and agarwal, sandhini and sastry, girish and askell, amanda and mishkin, pamela and clark, jack and others. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Cosmos on steroids: a cheap detector for cheapfakes. In Proceedings of the 12th ACM Multimedia Systems Conference, pages 327–331, 2021.
- Generalized zero and few-shot transfer for facial forgery detection. arXiv preprint arXiv:2006.11863, 2020.
- Cosmos: Catching out-of-context misinformation with self-supervised learning. arXiv preprint arXiv:2101.06278, 2021.
- Good news, everyone! context driven entity-aware captioning for news images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12466–12475, 2019.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Id-reveal: Identity-aware deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15108–15117, 2021.
- Lisa Fazio. Out-of-context photos are a powerful low-tech form of misinformation. The Conversation, 14, 2020.
- Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR, 2017.
- spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
- Multimedia semantic integrity assessment using joint embedding of images and text. In Proceedings of the 25th ACM international conference on Multimedia, pages 1465–1471, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685, 2018.
- A combination of visual-semantic reasoning and text entailment-based boosting algorithm for cheapfake detection. 2022a.
- Multimodal cheapfakes detection by utilizing image captioning for global context. In Proceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, page 9–16, New York, NY, USA, 2022b. Association for Computing Machinery.
- Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Newsclippings: Automatic generation of out-of-context multimodal media. arXiv preprint arXiv:2104.05893, 2021.
- Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18051–18061, 2022.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Multimodal analytics for real-world news using measures of cross-modal entity consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pages 16–25, 2020.
- r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. arXiv preprint arXiv:1911.03854, 2019.
- Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2307–2311. IEEE, 2019.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguistics.
- Deepfacelab: A simple, flexible and extensible face swapping framework. 2020.
- Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
- Deep multimodal image-repurposing detection. In Proceedings of the 26th ACM international conference on Multimedia, pages 1337–1345, 2018.
- Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3):171–188, 2020.
- Detecting cross-modal inconsistency to defend against neural fake news. arXiv preprint arXiv:2009.07698, 2020.
- Transform and tell: Entity-aware news image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13035–13045, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Luisa Verdoliva. Media forensics and deepfakes: an overview. IEEE Journal of Selected Topics in Signal Processing, 14(5):910–932, 2020.
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
- Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
- Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8261–8265. IEEE, 2019.
- Defending against neural fake news. Advances in neural information processing systems, 32, 2019.
- Two-stream neural networks for tampered face detection. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pages 1831–1839. IEEE, 2017.
- Anurag Singh (58 papers)
- Shivangi Aneja (11 papers)