Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Systematic Assessment of Factual Knowledge in Large Language Models (2310.11638v3)

Published 18 Oct 2023 in cs.CL

Abstract: Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in LLMs. However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Linhao Luo (31 papers)
  2. Thuy-Trang Vu (23 papers)
  3. Dinh Phung (147 papers)
  4. Gholamreza Haffari (141 papers)
Citations (5)