Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
32 tokens/sec
GPT-5 High Premium
25 tokens/sec
GPT-4o
103 tokens/sec
DeepSeek R1 via Azure Premium
64 tokens/sec
GPT OSS 120B via Groq Premium
469 tokens/sec
Kimi K2 via Groq Premium
227 tokens/sec
2000 character limit reached

Certifying Knowledge Comprehension in LLMs (2402.15929v3)

Published 24 Feb 2024 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs are increasingly deployed in safety-critical systems where they provide answers based on in-context information derived from knowledge bases. As LLMs are increasingly envisioned as superhuman agents, their proficiency in knowledge comprehension-extracting relevant information and reasoning over it to answer questions, a key facet of human intelligence-becomes crucial. However, existing evaluations of LLMs on knowledge comprehension are typically conducted on small test sets, but these datasets represent only a tiny fraction of the vast number of possible queries. Simple empirical evaluations on these limited test sets raises concerns about the reliability and generalizability of the results. In this work, we introduce the first specification and certification framework for knowledge comprehension in LLMs, providing formal probabilistic guarantees for reliability. Instead of a fixed dataset, we design novel specifications that mathematically represent prohibitively large probability distributions of knowledge comprehension prompts with natural noise, using knowledge graphs. From these specifications, we generate quantitative certificates that offer high-confidence, tight bounds on the probability that a given LLM correctly answers any question drawn from the specification distribution. We apply our framework to certify SOTA LLMs in two domains: precision medicine and general question-answering. Our results reveal previously unrecognized vulnerabilities in SOTA LLMs due to natural noise in the prompts. Additionally, we establish performance hierarchies with formal guarantees among the SOTA LLMs, particularly in the context of precision medicine question-answering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. On the opportunities and risks of foundation models, 2022.
  2. Fast and precise certification of transformers. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, pp.  466–481, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383912. doi: 10.1145/3453483.3454056. URL https://doi.org/10.1145/3453483.3454056.
  3. Interval Estimation for a Binomial Proportion. Statistical Science, 16(2):101 – 133, 2001. doi: 10.1214/ss/1009213286. URL https://doi.org/10.1214/ss/1009213286.
  4. Hybridqa: A dataset of multi-hop question answering over tabular and textual data, 2021.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL. Biometrika, 26(4):404–413, 12 1934. ISSN 0006-3444. doi: 10.1093/biomet/26.4.404. URL https://doi.org/10.1093/biomet/26.4.404.
  7. Joseph Collins. Binomial distribution: Hypothesis testing, confidence intervals (ci), and reliability with implementation in s-plus. pp.  43, 06 2010.
  8. Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
  9. Measuring massive multitask language understanding, 2021.
  10. Mistral 7b, 2023.
  11. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp, 2023.
  12. Decision-theoretical model for failures which are tackled by countermeasures. IEEE Transactions on Reliability, 63(2):583–592, 2014. doi: 10.1109/TR.2014.2315952.
  13. Internet-augmented language models through few-shot prompting for open-domain question answering, 2022.
  14. Holistic evaluation of language models, 2023.
  15. Robustness verification for transformers, 2020.
  16. An abstract domain for certifying neural networks. Proc. ACM Program. Lang., 3(POPL), jan 2019. doi: 10.1145/3290354. URL https://doi.org/10.1145/3290354.
  17. Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries, 2024.
  18. Llama 2: Open foundation and fine-tuned chat models, 2023.
  19. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 2022.
  20. Self-prompted chain-of-thought on large language models for open-domain multi-hop reasoning, 2023.
  21. Kepler: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics, 9:176–194, 2021.
  22. An LLM can fool itself: A prompt-based adversarial attack. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VVgGbB9TNV.
  23. Exploring the limits of chatgpt for query or aspect-based text summarization, 2023.
  24. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  25. React: Synergizing reasoning and acting in language models, 2023.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.