Claim Detection for Automated Fact-checking: A Survey on Monolingual, Multilingual and Cross-Lingual Research (2401.11969v3)
Abstract: Automated fact-checking has drawn considerable attention over the past few decades due to the increase in the diffusion of misinformation on online platforms. This is often carried out as a sequence of tasks comprising (i) the detection of sentences circulating in online platforms which constitute claims needing verification, followed by (ii) the verification process of those claims. This survey focuses on the former, by discussing existing efforts towards detecting claims needing fact-checking, with a particular focus on multilingual data and methods. This is a challenging and fertile direction where existing methods are yet far from matching human performance due to the profoundly challenging nature of the issue. Especially, the dissemination of information across multiple social platforms, articulated in multiple languages and modalities demands more generalized solutions for combating misinformation. Focusing on multilingual misinformation, we present a comprehensive survey of existing multilingual claim detection research. We present state-of-the-art multilingual claim detection research categorized into three key factors of the problem, verifiability, priority, and similarity. Further, we present a detailed overview of the existing multilingual datasets along with the challenges and suggest possible future advancements.
- Check-worthy claim detection across topics for automated fact-checking. PeerJ Computer Science 9, e1365.
- Real-time claim detection from news articles and retrieval of semantically-similar factchecks. arXiv preprint arXiv:1907.02030 .
- Polimi-flatearthers at checkthat! 2022: Gpt-3 applied to claim detection. Working Notes of CLEF .
- Contextual string embeddings for sequence labeling, in: Proceedings of the 27th international conference on computational linguistics, pp. 1638–1649.
- Fighting the covid-19 infodemic in social media: a holistic perspective and a call to arms, in: Proceedings of the International AAAI Conference on Web and Social Media, pp. 913–922.
- Fighting the covid-19 infodemic: modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society. arXiv preprint arXiv:2005.00033 .
- Arafacts: the first large arabic dataset of naturally occurring claims, in: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 231–236.
- Csecu-dsg at checkthat! 2023: transformer-based fusion approach for multimodal and multigenre check-worthiness. Working Notes of CLEF .
- Claimhunter: An unattended tool for automated claim detection on twitter., in: KnOD@ WWW.
- Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, P10008.
- Team buster. ai at checkthat! 2020 insights and recommendations to improve fact-checking., in: CLEF (Working Notes).
- Universal sentence encoder. arXiv preprint arXiv:1803.11175 .
- Check square at checkthat! 2020: Claim detection in social media via fusion of transformer and syntactic features .
- The state of human-centered nlp technology for fact-checking. Information processing & management 60, 103219.
- BERT: Pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
- Rumor, gossip and urban legends. Diogenes 54, 19–35.
- Nus-ids at checkthat! 2022: identifying check-worthiness of tweets using checkthat5. Working Notes of CLEF .
- A multilingual dataset for identification of factual claims in indian twitter, in: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 88–92.
- A density-based algorithm for discovering clusters in large spatial databases with noise, in: kdd, pp. 226–231.
- Fight against misinformation on social media: Detecting attention-worthy and harmful tweets and verifiable and check-worthy claims, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer. pp. 161–173.
- Language-agnostic bert sentence embedding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 878–891.
- Identifying checkworthy cure claims on twitter, in: Proceedings of the ACM Web Conference 2023, pp. 4015–4019.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics 10, 178–206.
- Analyzing misinformation claims during the 2022 brazilian general election on whatsapp, twitter, and kwai. arXiv preprint arXiv:2401.02395 .
- A survey on stance detection for mis-and disinformation identification, in: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1259–1277.
- bigir at checkthat! 2020: Multilingual bert for ranking arabic tweets by check-worthiness., in: CLEF (Working Notes).
- Cross-lingual transfer learning for check-worthy claim identification over twitter. arXiv preprint arXiv:2211.05087 .
- icompass at nlp4if-2021–fighting the covid-19 infodemic, in: Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pp. 115–118.
- Damascusteam at nlp4if2021: Fighting the arabic covid-19 infodemic on twitter using arabert, in: Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pp. 93–98.
- Rub-dfl at checkthat! 2022: Transformer models and linguistic features for identifying relevant .
- Claimrank: Detecting check-worthy claims in arabic and english, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 26–30.
- Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools and Applications 78, 15169–15211.
- Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542 .
- Tobb etu at checkthat! 2020: Prioritizing english and arabic claims based on check-worthiness.
- Too many claims to fact-check: Prioritizing political claims based on check-worthiness. arXiv preprint arXiv:2004.08166 .
- Re-think before you share: A comprehensive study on prioritizing check-worthy claims. IEEE transactions on computational social systems 10, 362–375.
- Claim matching beyond English to scale global fact-checking, in: Zong, C., Xia, F., Li, W., Navigli, R. (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online. pp. 4504–4517. URL: https://aclanthology.org/2021.acl-long.347, doi:10.18653/v1/2021.acl-long.347.
- Claim matching beyond english to scale global fact-checking, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4504–4517.
- Matching tweets with applicable fact-checks across languages .
- Toward automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection. Digital threats: research and practice 2, 1–16.
- Explainable automated fact-checking: A survey, in: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5430–5443.
- Explainable automated fact-checking for public health claims. arXiv preprint arXiv:2010.09926 .
- Semantic similarity models for automated fact-checking: Claimcheck as a claim matching tool. Profesional de la información 32.
- Logistic regression. Circulation 117, 2395–2399.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online. pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703, doi:10.18653/v1/2020.acl-main.703.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 .
- Least squares quantization in pcm. IEEE transactions on information theory 28, 129–137.
- Did i see it before? detecting previously-checked claims over twitter, in: European conference on information retrieval, Springer. pp. 367–381.
- This is not new! spotting previously-verified claims over twitter. Information Processing & Management 60, 103414.
- Nlp&ir@ uned at checkthat! 2020: A preliminary approach for check-worthiness and claim retrieval tasks using neural networks and graphs .
- hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 205.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 .
- True or false: Studying the work practices of professional fact-checkers. Proceedings of the ACM on Human-Computer Interaction 6, 1–44.
- A second pandemic? analysis of fake news about covid-19 vaccines in qatar, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 1010–1021.
- Overview of the clef-2022 checkthat! lab task 1 on identifying relevant claims in tweets, in: 2022 Conference and Labs of the Evaluation Forum, CLEF 2022, CEUR Workshop Proceedings (CEUR-WS. org). pp. 368–392.
- Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9, Springer. pp. 372–387.
- Bertweet: A pre-trained language model for english tweets, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 9–14.
- Mumin: A large-scale multilingual multimodal fact-checked misinformation social network dataset, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3141–3153.
- Detecting multilingual covid-19 misinformation on social media via contextualized embeddings, in: Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pp. 125–129.
- Self-supervised claim identification for automated fact checking, in: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 213–227.
- Assessing effectiveness of using internal signals for check-worthy claim identification in unlabeled data for automated fact-checking. arXiv preprint arXiv:2111.01706 .
- K-nearest neighbor. Scholarpedia 4, 1883.
- Adapterhub: A framework for adapting transformers, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54.
- Multilingual previously fact-checked claim retrieval. arXiv preprint arXiv:2305.07991 .
- Claim extraction from text using transfer learning., in: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 297–302.
- Improving language understanding by generative pre-training. URL: https://api.semanticscholar.org/CorpusID:49313245.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 5485–5551.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 .
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval 3, 333–389.
- Claimviz: Visual analytics for identifying and verifying factual claims, in: 2020 IEEE Visualization Conference (VIS), IEEE. pp. 246–250.
- Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65. URL: https://www.sciencedirect.com/science/article/pii/0377042787901257, doi:https://doi.org/10.1016/0377-0427(87)90125-7.
- Es-vrai at checkthat! 2023: Analyzing checkworthiness in multimodal and multigenre .
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 .
- Ai rational at checkthat! 2022: using transformer models for tweet classification. Working Notes of CLEF .
- Multilingual detection of check-worthy claims using world languages and adapter fusion, in: European Conference on Information Retrieval, Springer. pp. 118–133.
- Upv at checkthat! 2021: mitigating cultural differences for identifying multilingual check-worthy claims. arXiv preprint arXiv:2109.09232 .
- Findings of the NLP4IF-2021 shared task on fighting the COVID-19 infodemic and censorship detection, in: Proceedings of the Fourth Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, Association for Computational Linguistics, Online.
- That is a known lie: Detecting previously fact-checked claims, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3607–3618.
- Assisting the human fact-checkers: Detecting all previously fact-checked claims in a document, in: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 2069–2080.
- Overview of the clef-2021 checkthat! lab task 1 on check-worthiness estimation in tweets and political debates., in: CLEF (Working Notes), pp. 369–392.
- Utdrm: unsupervised method for training debunked-narrative retrieval models. EPJ Data Science 12, 59.
- Finding already debunked narratives via multistage retrieval: Enabling cross-lingual, cross-dataset and zero-shot learning. arXiv preprint arXiv:2308.05680 .
- Sciclops: Detecting and contextualizing scientific claims for assisting manual fact-checking, in: Proceedings of the 30th ACM international conference on information & knowledge management, pp. 1692–1702.
- Asatya at checkthat! 2022: multimodal bert for identifying claims in tweets. Working Notes of CLEF .
- Support vector machine. Machine learning models and algorithms for big data classification: thinking with examples for effective learning , 207–235.
- Z-index at checkthat! lab 2022: Check-worthiness identification on tweet text .
- Claimskg: A knowledge graph of fact-checked claims, in: The Semantic Web–ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18, Springer. pp. 309–324.
- FEVER: a large-scale dataset for fact extraction and VERification, in: Walker, M., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 809–819. URL: https://aclanthology.org/N18-1074, doi:10.18653/v1/N18-1074.
- Can multilingual transformers fight the covid-19 infodemic?, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 1432–1437.
- Attention is all you need. Advances in neural information processing systems 30.
- Accenture at checkthat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation .
- Towards automatic green claim detection, in: Proceedings of the 13th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 28–34.
- Automated fact-checking: A survey. Language and Linguistics Compass 15, e12438.
- Tobb etu at checkthat! 2021: Data engineering for detecting check-worthy claims., in: CLEF (Working Notes), pp. 670–680.
- Fight for 4230 at checkthat! 2021: Domain-specific preprocessing and pretrained model for ranking claims by check-worthiness., in: CLEF (Working Notes), pp. 681–692.
- Detection and resolution of rumours in social media: A survey. ACM Computing Surveys (CSUR) 51, 1–36.