Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Previously Fact-Checked Claim Retrieval (2305.07991v2)

Published 13 May 2023 in cs.CL

Abstract: Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper introduces a new multilingual dataset -- MultiClaim -- for previously fact-checked claim retrieval. We collected 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups. This is the most extensive and the most linguistically diverse dataset of this kind to date. We evaluated how different unsupervised methods fare on this dataset and its various dimensions. We show that evaluating such a diverse dataset has its complexities and proper care needs to be taken before interpreting the results. We also evaluated a supervised fine-tuning approach, improving upon the unsupervised method significantly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Alan Agresti and Brent A Coull. 1998. Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126.
  2. Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, volume 12260 of Lecture Notes in Computer Science, pages 215–236, Cham. Springer International Publishing.
  3. Harnessing abstractive summarization for fact-checked claim detection. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2934–2945, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  4. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
  5. Axel Bruns. 2019. After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11):1544–1566.
  6. Report on the user needs of fact-checkers. Technical report, NORDIS – NORdic observatory for digital media and information DISorders.
  7. ENISA. 2022. ENISA Threat Landscape 2022.
  8. Internet research: Ethical guidelines 3.0.
  9. Full Fact. 2020. The challenges of online fact checking.
  10. Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 [cs.CL].
  11. CrowdChecked: Detecting previously fact-checked claims in social media. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 266–285, Online only. Association for Computational Linguistics.
  12. Automated, not automatic: Needs and practices in European fact-checking organizations as a basis for designing human-centered AI systems. arXiv:2211.12143 [cs.CY].
  13. Categorising Fine-to-Coarse Grained Misinformation: An Empirical Study of COVID-19 Infodemic. arXiv:2106.11702 [cs].
  14. Claim matching beyond English to scale global fact-checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4504–4517, Online. Association for Computational Linguistics.
  15. Moreno Mancosu and Federico Vegetti. 2020. What You Can Scrape and What Is Right to Scrape: A Proposal for a Tool to Collect Public Facebook Data. Social Media + Society, 6(3). SAGE Publications Ltd.
  16. Did I See It Before? Detecting Previously-Checked Claims over Twitter. In Advances in Information Retrieval, Lecture Notes in Computer Science, pages 367–381, Cham. Springer International Publishing.
  17. True or false: Studying the work practices of professional fact-checkers. Proc. ACM Hum.-Comput. Interact., 6(CSCW1).
  18. Overview of the CLEF–2022 CheckThat! Lab on Fighting the COVID-19 infodemic and fake news detection. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 495–520, Cham. Springer International Publishing.
  19. Automated Fact-Checking for Assisting Human Fact-Checkers. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21), pages 4551–4558. International Joint Conferences on Artificial Intelligence Organization.
  20. Cross-lingual learning for text processing: A survey. Expert Systems with Applications, 165.
  21. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  22. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  23. Towards a standard for identifying and managing bias in artificial intelligence. NIST Special Publication 1270, National Institute of Standards and Technology.
  24. The role of context in detecting previously fact-checked claims. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1619–1631, Seattle, United States. Association for Computational Linguistics.
  25. That is a known lie: Detecting previously fact-checked claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3607–3618, Online. Association for Computational Linguistics.
  26. Assisting the human fact-checkers: Detecting all previously fact-checked claims in a document. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2069–2080, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Overview of the CLEF-2021 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims in Tweets and Political Debates. In Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, volume 2936. CEUR-WS.
  28. Article reranking by memory-enhanced key sentence matching for detecting previously fact-checked claims. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics.
  29. Monant: Universal and extensible platform for monitoring, detection and mitigation of antisocial behaviour. In Workshop on Reducing Online Misinformation Exposure – ROME 2019, colocated with SIGIR 2019.
  30. H.T. Tavani. 2016. Ethics and Technology: Controversies, Questions, and Strategies for Ethical Computing, 5th edition. Wiley.
  31. Leanne Townsend and Claire Wallace. 2016. Social media research: A guide to ethics.
  32. Nguyen Vo and Kyumin Lee. 2018. The Rise of Guardians: Fact-checking URL Recommendation to Combat Fake News. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 275–284, New York, NY, USA. Association for Computing Machinery.
  33. Nguyen Vo and Kyumin Lee. 2020. Where are the facts? Searching for fact-checked information to alleviate the spread of fake news. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7717–7731, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Matúš Pikuliak (12 papers)
  2. Ivan Srba (28 papers)
  3. Robert Moro (22 papers)
  4. Timo Hromadka (2 papers)
  5. Timotej Smolen (2 papers)
  6. Martin Melisek (1 paper)
  7. Ivan Vykopal (8 papers)
  8. Jakub Simko (18 papers)
  9. Juraj Podrouzek (3 papers)
  10. Maria Bielikova (27 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com