Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall (2404.16164v1)

Published 24 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

PDF Abstract

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

The paper "Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall" addresses the crucial need for comprehensive factuality assessments of LLMs. As the adoption of LLMs accelerates in diverse applications, the persistent issue of hallucinations necessitates robust benchmarks to evaluate factual knowledge recall. This paper introduces a novel benchmark, FACT-Bench, that encompasses a broad spectrum of domains, properties, and answer types to provide a detailed understanding of LLMs' capabilities and limitations in recalling factual knowledge from pretraining data.

FACT-Bench: Design and Scope

FACT-Bench is constructed to address several key evaluation criteria:

Simplicity: It focuses on simple question-answer (QA) pairs derived from Wikidata triplets, ensuring that questions are straightforward and only require knowledge recall.
Validity: It ensures that all questions are answerable by checking that the answers are grounded within Wikipedia, a common source for LLM pretraining datasets.
Diversity: It includes 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers), ensuring broad coverage.
Specificity: Questions are designed to elicit specific, unique answers, minimizing ambiguity and potential multiple valid responses.

Benchmark Results

The paper benchmarks 31 models across 10 different families, ranging from proprietary models like GPT-4 and GPT-3.5-turbo to open-source models like LLaMA, Falcon, and MPT. Empirical results reveal that instruction-tuning, while beneficial for aligning LLM outputs to user-friendly formats, often detracts from factual knowledge recall. For instance, pretraining-only models like LLaMA consistently outperform their instruction-tuned counterparts such as Vicuna on factual recall tasks.

Notable findings include:

Model Scaling: Larger models consistently outperform their smaller counterparts within the same family, revealing positive scaling effects on factual knowledge recall.
Instruction-Tuning: Instruction-tuned models generally underperform compared to their pretraining-only versions, both in zero-shot and few-shot settings, likely due to the alignment tax imposed by instruction-tuning.
In-Context Learning (ICL): While few-shot ICL provides substantial performance improvements, the benefits diminish beyond a certain number of shots. Especially in larger models, providing more than five exemplars yields negligible gains.

Counterfactual ICL Studies

An intriguing aspect of the paper is the exploration of counterfactual ICL, where models are provided with exemplars containing incorrect answers. Large models such as LLaMA-65B and Falcon-180B show significant degradation in performance when exposed to counterfactual exemplars that contradict their known knowledge. This effect is magnified with increasing numbers of such exemplars, underscoring the sensitivity of large models to the factual accuracy of their training and in-context data.

Fine-Tuning Experiments

The paper further explores the implications of fine-tuning with different types of knowledge:

Known vs Unknown Knowledge: Fine-tuning on knowledge already known to the model significantly improves performance compared to fine-tuning on mixed or entirely unknown knowledge.
Factual Accuracy in Training Data: Consistent with ICL results, the factual accuracy of fine-tuning data is crucial. Models fine-tuned with counterfactual data exhibit substantial performance drops, reinforcing that erroneous training data can teach models to "hallucinate" incorrect facts.

Practical and Theoretical Implications

The paper's findings have profound implications for both the deployment and future development of LLMs:

Practical Applications: Users of LLMs in critical applications such as medical advice or legal consultations should ensure the factual accuracy of input data and prefer fewer, high-quality in-context exemplars.
Future Developments: Researchers should focus on improving pretraining strategies and explore advanced fine-tuning techniques that mitigate the negative impacts of instruction-tuning on factual recall. Further studies might investigate dynamic scaling mechanisms tailored to specific factuality tasks.

Conclusion

The FACT-Bench benchmark represents a significant step towards a holistic evaluation of LLMs' factual knowledge recall, providing critical insights into the effects of model scaling, instruction-tuning, and the quality of training data. These findings can help guide the development of more reliable and accurate LLMs, ensuring their practical utility across various domains.