Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Automated Verification of Textual Claims (AVeriTeC) Shared Task (2410.23850v1)

Published 31 Oct 2024 in cs.CL

Abstract: The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), pages 1–13, Dominican Republic. Association for Computational Linguistics.
  2. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4685–4697, Hong Kong, China. Association for Computational Linguistics.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  4. Adrien Barbaresi. 2021. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, Online. Association for Computational Linguistics.
  5. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  6. Improving evidence retrieval on claim verification pipeline through question enrichment. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  7. Computational journalism: A call to arms to database researchers. In 5th Biennial Conference on Innovative Data Systems Research (CIDR).
  8. Assessing the reasoning abilities of chatgpt in the context of claim verification. Preprint, arXiv:2402.10735.
  9. Andy Dudfield. 2020. How we’re using AI to scale up global fact checking. https://fullfact.org/blog/2020/jul/afc-global/. Accessed: 2023-01-17.
  10. Missing counter-evidence renders NLP fact-checking unrealistic for misinformation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5916–5936, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  11. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
  12. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  13. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  14. H. W. Kuhn. 1955. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97.
  15. Matryoshka representation learning. In Advances in Neural Information Processing Systems.
  16. Debunking Handbook 2020. https://sks.to/db2020.
  17. Towards general text embeddings with multi-stage contrastive learning. Preprint, arXiv:2308.03281.
  18. GProofT: A multi-dimension multi-round fact checking framework based on claim fact extraction. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  19. FZI-WIM at averitec shared task: Real-world fact-checking with question answering. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  20. Christopher Malon. 2021. Team papelo at FEVEROUS: Multi-hop evidence pursuit. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), pages 40–49, Dominican Republic. Association for Computational Linguistics.
  21. Christopher Malon. 2024. Multi-hop evidence pursuit meets the web: Team papelo at FEVER 2024. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  22. Shrikant Malviya and Stamos Katsigiannis. 2024. SK_DU team: Cross-encoder based evidence retrieval and question generation with improved prompt for the AVeriTeC shared task. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  23. Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4089–4100, Online. Association for Computational Linguistics.
  24. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  25. Automated fact checking in the news room. In The Web Conference 2019, pages 3579–3583, United States. Association for Computing Machinery (ACM). 2019 World Wide Web Conference, WWW 2019 ; Conference date: 13-05-2019 Through 17-05-2019.
  26. Looking beyond sentence-level natural language inference for question answering and text summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1322–1336, Online. Association for Computational Linguistics.
  27. Zero-shot learning and key points are all you need for automated fact-checking. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  28. RAG-fusion based information retrieval for fact-checking. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  29. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
  30. Automated fact-checking for assisting human fact-checkers. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4551–4558. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
  31. Adjali Omar. 2024. Exploring retrieval augmented generation for real-world claim verification. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  32. Varifocal question generation for fact-checking. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2532–2544, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  33. Dunamu-ml’s submissions on AVeriTeC shared task. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  34. Karl Pearson. 1896. Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, (187):253–318.
  35. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  36. InFact: A strong baseline for automated fact-checking. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  37. Averitec: A dataset for real-world claim verification with evidence from the web. In Advances in Neural Information Processing Systems, volume 36, pages 65128–65167. Curran Associates, Inc.
  38. The intended uses of automated fact-checking artefacts: Why, how and who. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8618–8642, Singapore. Association for Computational Linguistics.
  39. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
  40. UHH at AVeriTeC: RAG for fact-checking with real-world claims. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  41. Evidence-backed fact checking using RAG and few-shot in-context learning with LLMs. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  42. C. Spearman. 1987. The proof and measurement of association between two things. The American Journal of Psychology, 100(3/4):441–471.
  43. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  44. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
  45. AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a simple RAG task. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  46. Retrieving semantics for fact-checking: A comparative approach using CQ (claim to question) & aq (answer to question). In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  47. Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pages 18–22, Baltimore, MD, USA. Association for Computational Linguistics.
  48. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
  49. William Yang Wang. 2017. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426, Vancouver, Canada. Association for Computational Linguistics.
  50. The herd of open llms for verifying real-world claims. In Proceedings of the Seventh Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
  51. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. Preprint, arXiv:2407.19669.
Citations (4)

Summary

  • The paper introduces AVeriTeC, a novel shared task using a comprehensive real-world claims dataset for automated verification.
  • It employs a multi-step methodology combining question generation, document parsing, and BERT-based veracity prediction to enhance evidence retrieval.
  • Evaluation shows that 18 of 21 submissions surpassed the baseline, with top systems achieving a 63% AVeriTeC score.

Overview of AVeriTeC Shared Task for Automated Verification of Textual Claims

The paper "The Automated Verification of Textual Claims (AVeriTeC) Shared Task" presents a detailed exploration of an innovative shared task focused on advancing the methods used in automated fact-checking (AFC). This task is structured to challenge participants in retrieving evidence and predicting the veracity of real-world claims. Importantly, it addresses the deficiencies of previous datasets, aiming to provide a more comprehensive and representative platform for evaluating AFC systems.

Key Contributions and Methodological Details

  1. Dataset and Methodology: AVeriTeC is built on a dataset that integrates real-world claims with evidence sourced from the web. It incorporates a structured approach to evidence retrieval, emphasizing question generation and answering. The dataset construction aims to overcome typical issues found in existing resources, such as lack of annotation and reliance on artificial claims. It comprises 4,568 claims initially, with an augmented test set adding another 1,215 claims for comprehensive evaluation.
  2. Baseline and Evaluation: The baseline system for AVeriTeC is influenced by previous work but is enhanced by leveraging a knowledge store that minimizes costs associated with evidence retrieval. This baseline consists of a multi-step process involving document parsing, question generation using BLOOM, and veracity prediction through pretrained BERT models. The shared task employed the AVeriTeC score, which requires both correct verdicts and high-quality evidence retrieval to achieve an accurate assessment.
  3. Submissions and Results: Evidence of high engagement with the shared task is demonstrated by the 21 submissions, out of which 18 surpassed the baseline. With diverse approaches, top teams like TUDA_MAI and HUMANE notably improved upon the baseline, reaching an AVeriTeC score of 63% and showcasing significant methodological advancements over their peers.
  4. Human Evaluation and Challenges: The paper provides an analysis of human evaluations conducted on the task outputs to ensure the reliability of system predictions. Issues such as scraper limitations in retrieving accurate knowledge store content and the need for better alignment between automated metrics and human judgments are highlighted.

Implications and Future Directions

This paper contributes substantially to the domain of AFC, presenting a structured task that combines retrieval and reasoning, critical for real-world fact-checking applications. By focusing on real-world claims, AVeriTeC pushes the boundaries of current AFC methods, challenging researchers to enhance retrieval accuracy and reasoning capabilities.

The task highlights significant research challenges and directions for future investigation. Notably, improving the capabilities of smaller, resource-efficient models to match the performance of larger LLMs could democratize access to state-of-the-art fact-checking technologies. Additionally, further refinement of evaluation metrics to better align with human judgments is necessary to advance the reliability and validity of AFC systems.

Conclusion

AVeriTeC fosters notable progress in automated fact-checking by creating a robust, real-world benchmark for evaluating the intersection of information retrieval and natural language processing. The findings and methodologies introduced in this shared task pave the way for more accurate and efficient fact-checking systems, with implications beyond academic research and into practical applications such as journalism and social media moderation. The potential for future developments is substantial, ensuring AVeriTeC remains a keystone in the ongoing enhancement of automated verification tools.

X Twitter Logo Streamline Icon: https://streamlinehq.com