Introduction
The proliferation of accessible online information, particularly in science and health domains, has necessitated the development of efficient fact-checking systems for scientific claims. Unlike typical setups that assume pre-identified documents containing evidence or those that operate on a confined collection of documents, the presented paper explores automated claim verification in a realistic open-domain context. This evaluation pioneers a broader and more practical approach towards the verification process by testing performance against significantly larger knowledge bases.
Experiment Design
The paper maintains a steady pipeline for evidence sentence selection and verdict prediction, while varying the knowledge sources and the document retrieval methods utilized. PubMed, Wikipedia, and Google are employed as vast repositories of information, and their effectiveness in evidence retrieval is gauged through two distinct methods: BM25 and semantic search using embeddings from BioSimCSE. The researchers root their evaluation in the eventual verdict prediction scores, leveraging four distinct datasets of biomedical and health claims, each with pre-determined veracity labels by domain experts.
Results and Analysis
From the experiments, Wikipedia and semantic search methods typically offer superior recall, showcasing these tools' ability to identify relevant evidence over a wide array of documents. In contrast, BM25 displays higher precision, indicating its capacity to find exact matches without being over-inclusive, although this comes at the expense of a broader coverage. The findings are nuanced when it comes to different types of claims and sources; Wikipedia is favored for popular health claims, while PubMed supports specialized medical inquiries more effectively. When "the whole web" is queried through Google, the results suggest an impressive performance, but nuances within the results, like higher scores on datasets with claims drawn directly from PubMed, outline the caveat of data leakage and the limitations of snippet-based evidence.
Conclusion and Future Work
The research substantiates the potential and challenges of open-domain scientific claim verification systems. Findings suggest that both PubMed and Wikipedia can serve as competent knowledge sources, with differences emerging based on the claim's nature. Dense retrieval methods are generally more effective than sparse ones, though each has particular scenarios where they may excel. The thorough exploration invites future work into areas including modeling disagreement, assessing evidence quality, and integrating retrieval-augmented generation strategies with LLMs to better support fact-checking. While the paper paves the way for robust AI-driven fact-checking, the authors duly point out the real-world applicability limitations and the ethical considerations when addressing sensitive health and medical misinformation.