Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs? (2308.10168v2)

Published 20 Aug 2023 in cs.CL

Abstract: Since the recent prosperity of LLMs, there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of LLMs, and whether Knowledge Graphs (KGs), which store the world knowledge in a symbolic form, will be replaced with LLMs. In this paper, we try to answer these questions from a new angle: How knowledgeable are LLMs? To answer this question, we constructed Head-to-Tail, a benchmark that consists of 18K question-answer (QA) pairs regarding head, torso, and tail facts in terms of popularity. We designed an automated evaluation method and a set of metrics that closely approximate the knowledge an LLM confidently internalizes. Through a comprehensive evaluation of 16 publicly available LLMs, we show that existing LLMs are still far from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities.

PDF Abstract

A Critical Evaluation of the Knowledge Capacity of LLMs: An Analytical Perspective

This paper, authored by Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong, presents an analytical investigation into the knowledge capacity of LLMs and explores the contentious question of whether LLMs could supplant Knowledge Graphs (KGs) in storing and applying factual knowledge. It introduces a novel benchmark named for evaluating the knowledge competence of LLMs and attempts to provide a structured methodology for understanding how much factual information these models successfully incorporate.

Core Contributions and Methodology

The benchmark is comprehensive, encompassing 18,000 question-answer pairs drawn from diverse domains. This dataset categorizes entities into head, torso, and tail buckets based on their popularity, allowing for an evaluation of LLM knowledge across different levels of entity visibility in data. The paper fairly scrutinizes 14 publicly available LLMs using this benchmark, presenting a quantified assessment of their factual grasp.

The evaluation employs specially curated metrics, primarily focusing on accuracy, hallucination rate, and missing rate. These metrics are designed to discern between correct answers, incorrect answers due to hallucinations, and admissions of uncertainty by the models, providing a nuanced view of LLM capabilities.

Key Findings

The paper uncovers several pivotal insights:

LLM Knowledge Deficiency: Contrary to expectations of comprehensive knowledge retention, the investigated LLMs often falter with nuanced, domain-specific information. Models demonstrated lower accuracy across the board, especially with torso and tail facts, challenging their ability to supplant the factual storage potential of KGs.
Model Performance Variations: There is a conspicuous decline in LLM performance from head entities toward tail entities, confirming the hypothesis that LLMs are better at integrating commonly available (head) knowledge than rare (tail) knowledge. This reflects training data limitations, indicative of the power law distribution in real-world data representation.
Impact of Instruction Tuning and Model Size: Larger model sizes and common enhancement techniques like instruction tuning were observed not to significantly bolster the factual reliability of LLMs. This calls for new strategies beyond traditional scaling and tuning approaches to address knowledge gaps effectively.

Implications for Future Research

The paper offers insightful strategic directions for future research in Knowledge Representation and AI. It proposes the concept of Dual Neural Knowledge Graphs, envisioning a hybrid approach that marries explicit KGs with implicit knowledge embeddings in LLMs. This dual structure promises to balance human-readable symbolic knowledge and machine-optimizable neural representations, potentially revolutionizing knowledge retrieval systems.

Conclusion

In summary, this paper provides a rigorous evaluation of LLMs’ knowledge retention abilities, challenging the narrative that LLMs might soon replace KGs. It highlights substantial limitations in the factual knowledge embedded in these models, especially for long-tail information. The introduction of the benchmark and the methodological framework serves as an essential resource for future studies. It amplifies the discourse on LLM knowledge integration, compelling researchers to explore innovative architectures and learning paradigms that bridge symbolic and neural knowledge aspects effectively.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kai Sun (317 papers)
Yifan Ethan Xu (9 papers)
Hanwen Zha (8 papers)
Yue Liu (256 papers)
Xin Luna Dong (46 papers)

Citations (95)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos