FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios (2307.13528v2)

Published 25 Jul 2023 in cs.CL and cs.AI

Abstract: The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for individual facts. (3) There is a scarcity of explicit evidence available during the process of fact checking. With the above challenges in mind, in this paper, we propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by LLMs (e.g., ChatGPT). Experiments on four different tasks (knowledge-based QA, code generation, mathematical reasoning, and scientific literature review) show the efficacy of the proposed method. We release the code of FacTool associated with ChatGPT plugin interface at https://github.com/GAIR-NLP/factool .

Citations (148)

View on Semantic Scholar

Summary

The paper presents a novel multi-task framework that integrates tool use to extract, query, and verify factual claims in generated texts.
Experimental results show FacTool achieving up to 95.24% F1 on literature review tasks and outperforming baseline self-checking models.
FacTool’s approach enhances LLM reliability for high-stakes applications in fields like healthcare, law, and scientific research.

Factuality Detection in Generative AI: An Analysis of FacTool

The paper "FacTool: Factuality Detection in Generative AI," introduces a framework designed to effectively detect factual errors in text generated by LLMs. The emergence of models like GPT-4 has revolutionized NLP, allowing various tasks to be unified under sequence generation. However, factual inaccuracies remain a significant challenge, restricting the application of these models in high-stakes fields such as healthcare and law. The authors propose a task and domain-agnostic system named FacTool to address this challenge systematically.

Key Contributions and Methodology

FacTool integrates the concept of "tool use" with factuality detection, leveraging capabilities like Google Search, Google Scholar, and Python interpreters to assess the factuality of generated content across diverse tasks. These tasks include knowledge-based QA, code generation, mathematical problem-solving, and scientific literature review writing. The framework consists of several components:

Claim Extraction: Utilizing large models like GPT-4 and ChatGPT, FacTool extracts claims from generated text, focusing on atomic content units and fine-grained factual details.
Query Generation: The framework generates queries from extracted claims to gather external evidence using different tools.
Tool Querying and Evidence Collection: Leveraging APIs such as Google Search and Google Scholar, FacTool collects evidence to verify claims.
Agreement Verification: This step involves analyzing evidence to assign binary factuality labels to the claims.

The research introduces innovative methods to adapt traditional factuality detection to more complex, generative AI tasks without explicit evidence or claims. The emphasis on fine-grained factuality detection ensures a more comprehensive assessment of LLM outputs, facilitating the improvement of their reliability.

Experimental Results

The paper presents robust experimental results across various datasets, including RoSE, HumanEval, GSM-Hard, and custom scientific literature review prompts. FacTool outperformed baseline models like self-checking LLM systems across all scenarios, particularly in scientifically critical contexts like literature reviews. For instance, FacTool achieved a factuality F1 score of 89.09% on knowledge-based QA and 95.24% on scientific literature review tasks when powered by GPT-4.

The framework demonstrates significant advantages over existing single-task systems by employing a versatile, multi-tool strategy. The results underscore the importance of integrating external verification methods in LLM-generated content.

Implications and Future Directions

FacTool's approach offers substantial implications for the deployment of LLMs in sensitive domains. By providing a reliable method for factuality detection, the framework supports the use of LLMs in fields where accuracy is critical. Additionally, FacTool sets the groundwork for developing more effective hybrid models that incorporate both generative capabilities and robust factual verification.

Future research could extend FacTool's architecture to more complex multi-modal tasks, integrating image or video data alongside text. Additionally, exploring the framework's adaptability to other languages and cultural contexts could further enhance its application scope.

In conclusion, FacTool presents a compelling solution to one of the pressing challenges in modern AI, offering a sophisticated, multi-dimensional approach to factuality detection. The paper advances the discourse in both theoretical AI alignment and practical application, paving the way for safer AI deployments in real-world scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - GAIR-NLP/factool: FacTool: Factuality Detection in Generative AI (823 stars)

YouTube

Show All Videos