The paper discusses the development of automated claim verification systems in open-domain scientific contexts using large knowledge bases.
Various information sources, including PubMed, Wikipedia, and Google, are analyzed for their effectiveness in retrieving evidence through BM25 and semantic search methodologies.
The study reveals that while Wikipedia and semantic search have better recall, BM25 demonstrates higher precision, with performance nuances depending on the type of claim and source.
The need for future work in modeling disagreement, assessing evidence quality, and integrating advanced retrieval strategies with LLMs is highlighted, along with the real-world limitations of such systems.
The proliferation of accessible online information, particularly in science and health domains, has necessitated the development of efficient fact-checking systems for scientific claims. Unlike typical setups that assume pre-identified documents containing evidence or those that operate on a confined collection of documents, the presented study explores automated claim verification in a realistic open-domain context. This evaluation pioneers a broader and more practical approach towards the verification process by testing performance against significantly larger knowledge bases.
The study maintains a steady pipeline for evidence sentence selection and verdict prediction, while varying the knowledge sources and the document retrieval methods utilized. PubMed, Wikipedia, and Google are employed as vast repositories of information, and their effectiveness in evidence retrieval is gauged through two distinct methods: BM25 and semantic search using embeddings from BioSimCSE. The researchers root their evaluation in the eventual verdict prediction scores, leveraging four distinct datasets of biomedical and health claims, each with pre-determined veracity labels by domain experts.
From the experiments, Wikipedia and semantic search methods typically offer superior recall, showcasing these tools' ability to identify relevant evidence over a wide array of documents. In contrast, BM25 displays higher precision, indicating its capacity to find exact matches without being over-inclusive, although this comes at the expense of a broader coverage. The findings are nuanced when it comes to different types of claims and sources; Wikipedia is favored for popular health claims, while PubMed supports specialized medical inquiries more effectively. When "the whole web" is queried through Google, the results suggest an impressive performance, but nuances within the results, like higher scores on datasets with claims drawn directly from PubMed, outline the caveat of data leakage and the limitations of snippet-based evidence.
The research substantiates the potential and challenges of open-domain scientific claim verification systems. Findings suggest that both PubMed and Wikipedia can serve as competent knowledge sources, with differences emerging based on the claim's nature. Dense retrieval methods are generally more effective than sparse ones, though each has particular scenarios where they may excel. The thorough exploration invites future work into areas including modeling disagreement, assessing evidence quality, and integrating retrieval-augmented generation strategies with LLMs to better support fact-checking. While the study paves the way for robust AI-driven fact-checking, the authors duly point out the real-world applicability limitations and the ethical considerations when addressing sensitive health and medical misinformation.