FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation (2310.03214v2)

Published 5 Oct 2023 in cs.CL

Abstract: Most LLMs are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshLLMs/freshqa and commit to updating it at regular intervals.

PDF Abstract

Refreshing LLMs with Search Engine Augmentation

The paper "FreshLLMs: Refreshing LLMs with Search Engine Augmentation" by Vu et al. presents a paper addressing the limitation of static knowledge in LLMs. The authors highlight that most LLMs lack the ability to update their knowledge dynamically, given they are trained on static corpora. This paper introduces FreshQA, a novel dynamic QA benchmark designed to assess the factual accuracy of LLMs in providing up-to-date information.

Research Context

LLMs such as Bard and ChatGPT are powerful tools for open-domain conversations. However, due to the fixed nature of their training datasets, these models often generate information that is outdated or incorrect. The research focuses on addressing this flaw by integrating real-time information using search engine technologies.

FreshQA: A Novel Benchmark

FreshQA is a dynamic benchmark created to evaluate LLMs' ability to incorporate current knowledge. The benchmark consists of 600 questions divided into categories based on the stability of the answer over time: never-changing, slow-changing, fast-changing, and false-premise questions. This categorization ensures a comprehensive evaluation of the models' abilities to handle varying types of information.

Methodology and Findings

The evaluation involved a thorough human assessment, with over 50,000 judgments, to determine the accuracy and tendency for hallucination in model-generated responses. It is revealed that all models struggle with fast-changing and false-premise questions, with a significant gap between the current capabilities and the desired performance.

To address these limitations, the authors propose FreshPrompt, an innovative few-shot prompting mechanism that integrates live data from search engines into the prompt context. This method has been shown to outperform existing approaches like Self-Ask and commercial systems such as Perplexity.AI. FreshPrompt significantly enhances the factual accuracy of LLMs, with empirical evidence indicating improvements in handling dynamic and false premises.

Numerical Results

The results demonstrate substantial improvements in model performance. For instance, the FreshPrompt approach boosts accuracy on FreshQA by 32.6% under relaxed evaluation and by 49.0% under strict evaluation compared to baseline GPT-4. These numbers highlight the potential of real-time data augmentation to enhance the trustworthiness of LLM-generated responses.

Implications and Future Directions

The integration of search engine data provides a pathway for LLMs to maintain relevance and accuracy. This approach has significant practical implications across domains where timely and accurate information is crucial. Theoretically, it hints at the necessity for LLM architectures to be adaptable and receptive to new information streams beyond the initial training datasets.

Future research can explore more sophisticated methods for context integration, automated updating of QA datasets, and evaluation in multilingual and long-form QA settings. Additionally, the potential for combining in-context learning with real-time training updates to maintain information relevance should be investigated.

This paper effectively emphasizes the need for dynamic adaptability in AI models, marking a significant step towards creating more reliable and knowledgeable systems. The release of FreshQA and the commitment to regular updates encourage ongoing exploration in enhancing the temporal reasoning abilities of LLMs.