Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EpiK-Eval: Evaluation for Language Models as Epistemic Models (2310.15372v2)

Published 23 Oct 2023 in cs.CL and cs.AI

Abstract: In the age of artificial intelligence, the role of LLMs is becoming increasingly central. Despite their growing prevalence, their capacity to consolidate knowledge from different training documents - a crucial ability in numerous applications - remains unexplored. This paper presents the first study examining the capability of LLMs to effectively combine such information within their parameter space. We introduce EpiK-Eval, a novel question-answering benchmark tailored to evaluate LLMs' proficiency in formulating a coherent and consistent knowledge representation from segmented narratives. Evaluations across various LLMs reveal significant weaknesses in this domain. We contend that these shortcomings stem from the intrinsic nature of prevailing training objectives. Consequently, we advocate for refining the approach towards knowledge consolidation, as it harbors the potential to dramatically improve their overall effectiveness and performance. The findings from this study offer insights for developing more robust and reliable LLMs. Our code and benchmark are available at https://github.com/chandar-lab/EpiK-Eval

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  2. A review on language models as knowledge bases.
  3. PaLM 2 Technical Report. arXiv e-prints, page arXiv:2305.10403.
  4. Anthropic. 2023. Introducing 100k context windows. https://www.anthropic.com/index/100k-context-windows.
  5. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv e-prints, page arXiv:2204.05862.
  6. Unlimiformer: Long-Range Transformers with Unlimited Length Input. arXiv e-prints, page arXiv:2305.01625.
  7. Language Models are Few-Shot Learners. arXiv e-prints, page arXiv:2005.14165.
  8. Broken Neural Scaling Laws. arXiv e-prints, page arXiv:2210.14891.
  9. Quantifying memorization across neural language models.
  10. Extracting training data from large language models. In USENIX Security Symposium.
  11. Scaling instruction-finetuned language models.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  13. Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  14. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv e-prints, page arXiv:1901.02860.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  16. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273.
  17. Yoav Goldberg. 2019. Assessing bert’s syntactic abilities.
  18. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
  19. Do language models have coherent mental models of everyday things?
  20. REALM: Retrieval-Augmented Language Model Pre-Training. arXiv e-prints, page arXiv:2002.08909.
  21. Human-like property induction is a challenge for large language models.
  22. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.
  23. Benjamin Heinzerling and Kentaro Inui. 2020. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. arXiv preprint arXiv:2008.09036.
  24. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. arXiv e-prints, page arXiv:2011.01060.
  25. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
  26. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. In AAAI.
  27. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  28. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  29. Philip N Johnson-Laird. 2010. Mental models and human reasoning. Proceedings of the National Academy of Sciences, 107(43):18243–18250.
  30. Large language models struggle to learn long-tail knowledge.
  31. Dense Passage Retrieval for Open-Domain Question Answering. arXiv e-prints, page arXiv:2004.04906.
  32. Measuring the knowledge acquisition-utilization gap in pretrained language models. arXiv preprint arXiv:2305.14775.
  33. Which linguist invented the lightbulb? presupposition verification for question-answering. arXiv preprint arXiv:2101.00391.
  34. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329.
  35. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv e-prints, page arXiv:2005.11401.
  36. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737.
  37. Computational language acquisition with theory of mind. In The Eleventh International Conference on Learning Representations.
  38. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  39. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  40. Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8):1–42.
  41. A Survey on Multi-hop Question Answering and Generation. arXiv e-prints, page arXiv:2204.09140.
  42. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven.
  43. Shima Rahimi Moghaddam and Christopher J Honey. 2023. Boosting theory-of-mind performance in large language models via prompting. arXiv preprint arXiv:2304.11490.
  44. Karim Nader. 2009. Reconsolidation: A possible bridge between cognitive and neuroscientific views of memory. The cognitive neurosciences, pages 691–703.
  45. Webgpt: Browser-assisted question-answering with human feedback.
  46. OpenAI. 2023. GPT-4 Technical Report. arXiv e-prints, page arXiv:2303.08774.
  47. Fine-tuning language models via epistemic neural networks.
  48. Art: Automatic multi-step reasoning and tool-use for large language models.
  49. Gorilla: Large language model connected with massive apis.
  50. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, Brussels, Belgium. Association for Computational Linguistics.
  51. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  52. Hyena hierarchy: Towards larger convolutional language models.
  53. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 4(4):515–629.
  54. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  55. Parallel context windows for large language models.
  56. Rasmus Rendsvig and John Symons. 2019. Epistemic logic.
  57. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  58. Large language models are not zero-shot communicators. arXiv preprint arXiv:2210.14986.
  59. Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312.
  60. Toolformer: Language models can teach themselves to use tools.
  61. In-Context Pretraining: Language Modeling Beyond Document Boundaries. arXiv e-prints, page arXiv:2310.10638.
  62. Can language models be biomedical knowledge bases? arXiv preprint arXiv:2109.07154.
  63. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
  64. Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems.
  65. Llama: Open and efficient foundation language models.
  66. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv e-prints, page arXiv:2307.09288.
  67. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
  68. Language models are open knowledge graphs. arXiv preprint arXiv:2010.11967.
  69. Yanjing Wang. 2015. A logic of knowing how. In Logic, Rationality, and Interaction: 5th International Workshop, LORI 2015, Taipei, Taiwan, October 28-30, 2015. Proceedings 5, pages 392–405. Springer.
  70. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  71. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  72. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  73. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. arXiv e-prints, page arXiv:1710.06481.
  74. Self-adaptive in-context learning. arXiv preprint arXiv:2212.10375.
  75. Adapting pretrained text-to-text models for long text sequences.
  76. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv e-prints, page arXiv:1809.09600.
  77. Language models as inductive reasoners. arXiv preprint arXiv:2212.10923.
  78. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488.
  79. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
  80. OPT: Open Pre-trained Transformer Language Models. arXiv e-prints, page arXiv:2205.01068.
  81. Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Gabriele Prato (7 papers)
  2. Jerry Huang (18 papers)
  3. Prasannna Parthasarathi (2 papers)
  4. Shagun Sodhani (33 papers)
  5. Sarath Chandar (93 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.