Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall
The paper "Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall" addresses the crucial need for comprehensive factuality assessments of LLMs. As the adoption of LLMs accelerates in diverse applications, the persistent issue of hallucinations necessitates robust benchmarks to evaluate factual knowledge recall. This paper introduces a novel benchmark, FACT-Bench, that encompasses a broad spectrum of domains, properties, and answer types to provide a detailed understanding of LLMs' capabilities and limitations in recalling factual knowledge from pretraining data.
FACT-Bench: Design and Scope
FACT-Bench is constructed to address several key evaluation criteria:
- Simplicity: It focuses on simple question-answer (QA) pairs derived from Wikidata triplets, ensuring that questions are straightforward and only require knowledge recall.
- Validity: It ensures that all questions are answerable by checking that the answers are grounded within Wikipedia, a common source for LLM pretraining datasets.
- Diversity: It includes 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers), ensuring broad coverage.
- Specificity: Questions are designed to elicit specific, unique answers, minimizing ambiguity and potential multiple valid responses.
Benchmark Results
The paper benchmarks 31 models across 10 different families, ranging from proprietary models like GPT-4 and GPT-3.5-turbo to open-source models like LLaMA, Falcon, and MPT. Empirical results reveal that instruction-tuning, while beneficial for aligning LLM outputs to user-friendly formats, often detracts from factual knowledge recall. For instance, pretraining-only models like LLaMA consistently outperform their instruction-tuned counterparts such as Vicuna on factual recall tasks.
Notable findings include:
- Model Scaling: Larger models consistently outperform their smaller counterparts within the same family, revealing positive scaling effects on factual knowledge recall.
- Instruction-Tuning: Instruction-tuned models generally underperform compared to their pretraining-only versions, both in zero-shot and few-shot settings, likely due to the alignment tax imposed by instruction-tuning.
- In-Context Learning (ICL): While few-shot ICL provides substantial performance improvements, the benefits diminish beyond a certain number of shots. Especially in larger models, providing more than five exemplars yields negligible gains.
Counterfactual ICL Studies
An intriguing aspect of the paper is the exploration of counterfactual ICL, where models are provided with exemplars containing incorrect answers. Large models such as LLaMA-65B and Falcon-180B show significant degradation in performance when exposed to counterfactual exemplars that contradict their known knowledge. This effect is magnified with increasing numbers of such exemplars, underscoring the sensitivity of large models to the factual accuracy of their training and in-context data.
Fine-Tuning Experiments
The paper further explores the implications of fine-tuning with different types of knowledge:
- Known vs Unknown Knowledge: Fine-tuning on knowledge already known to the model significantly improves performance compared to fine-tuning on mixed or entirely unknown knowledge.
- Factual Accuracy in Training Data: Consistent with ICL results, the factual accuracy of fine-tuning data is crucial. Models fine-tuned with counterfactual data exhibit substantial performance drops, reinforcing that erroneous training data can teach models to "hallucinate" incorrect facts.
Practical and Theoretical Implications
The paper's findings have profound implications for both the deployment and future development of LLMs:
- Practical Applications: Users of LLMs in critical applications such as medical advice or legal consultations should ensure the factual accuracy of input data and prefer fewer, high-quality in-context exemplars.
- Future Developments: Researchers should focus on improving pretraining strategies and explore advanced fine-tuning techniques that mitigate the negative impacts of instruction-tuning on factual recall. Further studies might investigate dynamic scaling mechanisms tailored to specific factuality tasks.
Conclusion
The FACT-Bench benchmark represents a significant step towards a holistic evaluation of LLMs' factual knowledge recall, providing critical insights into the effects of model scaling, instruction-tuning, and the quality of training data. These findings can help guide the development of more reliable and accurate LLMs, ensuring their practical utility across various domains.