PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations (2405.19740v2)

Published 30 May 2024 in cs.CL, cs.AI, and cs.CY

Abstract: Expert-designed close-ended benchmarks are indispensable in assessing the knowledge capacity of LLMs. Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through \textbf{knowledge-invariant perturbations}. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of \textbf{response consistency analyses} that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six representative LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 25.8% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, and reveals their potential rote memorization to correct options which leads to overestimated performance. We also find that the detailed response consistency analyses by PertEval could illuminate various weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs. Our code is available at \url{https://github.com/aigc-apps/PertEval}.

Authors (8)

Jiatong Li (47 papers)
Renjun Hu (9 papers)
Kunzhe Huang (7 papers)
Yan Zhuang (62 papers)
Qi Liu (485 papers)
Mengxiao Zhu (7 papers)
Xing Shi (20 papers)
Wei Lin (207 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/WGOV/status/1796452463685288235

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations (2405.19740v2)

Summary

Related Papers

Tweets