Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction (2404.12957v2)

Published 19 Apr 2024 in cs.CL and cs.LG

Abstract: In this paper, we focus on the challenging task of reliably estimating factual knowledge that is embedded inside LLMs. To avoid reliability concerns with prior approaches, we propose to eliminate prompt engineering when probing LLMs for factual knowledge. Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the in-context learning ability of LLMs to communicate both the factual knowledge question as well as the expected answer format. Our knowledge estimator is both conceptually simpler (i.e., doesn't depend on meta-linguistic judgments of LLMs) and easier to apply (i.e., is not LLM-specific), and we demonstrate that it can surface more of the latent knowledge embedded in LLMs. We also investigate how different design choices affect the performance of ZP-LKE. Using the proposed estimator, we perform a large-scale evaluation of the factual knowledge of a variety of open-source LLMs, like OPT, Pythia, Llama(2), Mistral, Gemma, etc. over a large set of relations and facts from the Wikidata knowledge base. We observe differences in the factual knowledge between different model families and models of different sizes, that some relations are consistently better known than others but that models differ in the precise facts they know, and differences in the knowledge of base models and their finetuned counterparts. Code available at: https://github.com/QinyuanWu0710/ZeroPrompt_LKE

Overview of Latent Knowledge Estimation in LLMs

The paper "Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction" investigates the critical process of estimating factual knowledge embedded in LLMs. This issue is of primary concern as LLMs are increasingly used in tasks requiring factual accuracy, such as information retrieval and question answering. The authors seek to address the limitations of prior methods with an innovative approach focusing on in-context learning (ICL), as opposed to the prevailing strategies based on prompt engineering.

At the core of this research lies the proposition of a novel latent knowledge estimator (LKE) that leverages the ICL capabilities of LLMs. This estimator is designed to measure the factual knowledge within LLMs with increased reliability, avoiding the pitfalls associated with prompt-based methods. Previous methodologies typically relied on prompts, which could be influenced by prompt engineering biases and were heavily dependent on the model's ability to interpret specific linguistic constructs. The ICL-based method proposed here is conceptually simpler and more versatile, as it leverages patterns in input data to elicit knowledge from the model without intricate prompt structures.

Methodology

The paper details a method where the latent knowledge estimator uses factual knowledge encoded in the form of triplets (subject, relation, object) to infer the LLM's understanding of various facts. By employing ICL, the researchers construct sequences that include several related factual examples, prompting the model to complete these sequences based on learned patterns rather than explicit prompts. This ICL-based approach circumvents issues of prompt-related variability by consistently using factual patterns inherent to the training data.

The paper also addresses potential challenges in ICL-based estimation by analyzing the effect of differing in-context examples, including their number, accuracy, and order within sequences. Through comprehensive experimentation, the researchers demonstrate that models with inherent knowledge require fewer in-context examples to achieve reliable outputs, and that the presence of incorrect examples in the input sequence can significantly deteriorate the accuracy of knowledge extraction.

Results and Implications

The authors conducted evaluations using 49 different open-source LLMs from various families and scales. These include models like OPT, Pythia, Llama(2), among others, using a comprehensive dataset from Wikidata. The paper showed that certain model families such as Llama2, Mistral, and Gemma consistently outperformed others in terms of factual knowledge estimation. Furthermore, larger models generally had superior knowledge retention compared to smaller models within the same family, suggesting that increased parameter size corresponds to increased factual knowledge retention. Notably, the research indicates a reduction in latent knowledge after models undergo fine-tuning, which primarily aims to enhance instruction-following capabilities rather than factual recall.

Future Directions

This research opens the door for several future explorations. The ICL-based LKE exhibits the potential to be adapted or extended to assess more complex types of factual knowledge, further delineating patterns of retention across diverse knowledge domains. Moreover, the investigation into the effects of model architecture and training regimes on factual accuracy could be expanded, providing deeper insight into optimizing LLM training for factual reliability.

The ability of LLMs to hold and recall complex real-world facts has broad implications for their deployment in AI systems that interface directly with humans. This paper's findings urge researchers to carefully consider the trade-offs involved in model fine-tuning and scaling, ensuring that enhancements in one aspect do not detract from another. As the field progresses, robust methodologies like the one proposed here offer vital contributions toward understanding and improving the utility and reliability of LLMs in factual applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  2. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  3. Inducing relational knowledge from bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7456–7463.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Discovering Latent Knowledge in Language Models Without Supervision. arXiv preprint. ArXiv:2212.03827 [cs].
  6. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv preprint. ArXiv:2307.13528 [cs] version: 2.
  7. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  8. Promptbreeder: Self-referential self-improvement via prompt evolution. Preprint, arXiv:2309.16797.
  9. Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040–5060.
  10. RefChecker for fine-grained hallucination detection.
  11. Do large language models know about facts? arXiv preprint arXiv:2310.05177.
  12. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  13. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  14. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  15. Measuring the knowledge acquisition-utilization gap in pretrained language models. arXiv preprint arXiv:2305.14775.
  16. Evaluating the Factual Consistency of Abstractive Text Summarization. arXiv preprint. ArXiv:1910.12840 [cs].
  17. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA. Association for Computing Machinery.
  18. P-adapters: Robustly extracting factual information from language models with diverse prompts. In International Conference on Learning Representations.
  19. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  20. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  21. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
  22. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
  23. On early detection of hallucinations in factual question answering. arXiv preprint arXiv:2312.14183.
  24. Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? A.K.A. Will LLMs Replace Knowledge Graphs? arXiv preprint. ArXiv:2308.10168 [cs].
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  26. Language models are open knowledge graphs. arXiv preprint arXiv:2010.11967.
  27. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
  28. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  29. A survey on large language models for recommendation. arXiv preprint arXiv:2305.19860.
  30. LLM lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469.
  31. KoLA: Carefully benchmarking world knowledge of large language models. In The Twelfth International Conference on Learning Representations.
  32. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21.
  33. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  34. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  35. Efficiently programming large language models using sglang. Preprint, arXiv:2312.07104.
  36. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.
  37. Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Qinyuan Wu (7 papers)
  2. Mohammad Aflah Khan (11 papers)
  3. Soumi Das (9 papers)
  4. Vedant Nanda (16 papers)
  5. Bishwamittra Ghosh (14 papers)
  6. Camila Kolling (7 papers)
  7. Till Speicher (10 papers)
  8. Laurent Bindschaedler (2 papers)
  9. Krishna P. Gummadi (68 papers)
  10. Evimaria Terzi (30 papers)