Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories (2212.10511v4)

Published 20 Dec 2022 in cs.CL, cs.AI, and cs.LG
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Abstract: Despite their impressive performance on diverse tasks, LLMs (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the long tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.

Investigating the Efficacy of Parametric and Non-Parametric Memories in LLMs

Recent research efforts have explored the capabilities and limitations of LLMs (LMs) in retaining and recalling factual knowledge. The paper "When Not to Trust LLMs: Investigating Effectiveness of Parametric and Non-Parametric Memories" contributes to this line of inquiry by systematically probing LMs on their ability to memorize and retrieve factual knowledge across diverse subject entities and relationship types. The analysis involves comprehensive evaluations using a newly introduced dataset, PopQA, and the existing EntityQuestions dataset to assess LMs' parametric knowledge and their performance when augmented with non-parametric memory retrieval.

The paper elucidates that LMs exhibit significant proficiency in memorizing factual knowledge when it pertains to popular or frequently encountered entities. However, their performance wanes considerably in the long tail of less popular entities. The research identifies a strong positive correlation between entity popularity and memorization accuracy: larger LMs, such as GPT-3, demonstrate improved recall for popular knowledge but struggle with less frequent data. This implies that while scaling enhances LMs' parametric memory capacities for widely known facts, it does not notably augment performance for less commonly discussed entities.

To address these limitations, the authors explore integrating retrieval-augmented techniques to complement the inherent weaknesses of parametric memory in LMs. The retrieval-augmented models leverage external non-parametric memories by incorporating retrieval mechanisms like BM25 and Contriever, and generate-read methods such as GenRead. The results are noteworthy: smaller models, when augmented with retrieval capabilities, can achieve accuracies surpassing much larger unassisted models. For instance, Contriever-augmented LMs outperform vanilla GPT-3 models, particularly in recalling less popular factual information.

However, the augmentation is not without its pitfalls. The retrieval-based enhancement can occasionally mislead LMs, especially when the retrieved documents present incorrect or irrelevant information, resulting in reduced performance in answering questions about popular entities correctly memorized by the models.

To mitigate this, the paper introduces the concept of Adaptive Retrieval—a dynamic strategy that selectively activates retrieval augmentation based on heuristically determined popularity thresholds for different relationship types. This approach capitalizes on the findings that popular knowledge is often well-memorized by parametric models, while less popular knowledge benefits from external retrieval. The Adaptive Retrieval method showcases improved performance and reduced inference costs, especially with larger models, underscoring its potential utility in practical implementations.

The implications of this research are significant for the future development of LLMs. It suggests a path towards more efficient models that judiciously blend parametric and non-parametric knowledge, enhancing both accuracy and efficiency. Furthermore, it prompts further exploration into more nuanced strategies for retrieval augmentation, potentially integrating more sophisticated calibration techniques with robust retrieval systems. As LMs continue to evolve, the insights from this paper provide a foundational understanding of optimizing knowledge retrieval to balance memory capabilities across the spectrum of factual information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Nancy E Adams. 2015. Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association.
  2. Evidentiality-guided generation for knowledge-intensive NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  3. MS MARCO: A human generated machine reading comprehension dataset.
  4. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models.
  5. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning.
  6. Language models are few-shot learners. In Advances in Neural Information Processing systems.
  7. Knowledgeable or educated guess? revisiting language models as knowledge bases. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.
  8. Quantifying memorization across neural language models.
  9. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  10. PaLM: Scaling language modeling with pathways.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  12. You can’t pick your neighbors, or can you? when and how to rely on retrieval in the kNN-LM. In Findings of EMNLP.
  13. Paolo Ferragina and Ugo Scaiella. 2010. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management.
  14. Entities as experts: Sparse memory access with entity supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  15. Efficient nearest neighbor language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  16. Are large pre-trained language models leaking your personal information? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  17. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  18. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
  19. Few-shot learning with retrieval augmented language models.
  20. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  21. Language models (mostly) know what they know.
  22. Large language models struggle to learn long-tail knowledge.
  23. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  24. Realtime QA: What’s the answer right now?
  25. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
  26. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems.
  28. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  29. Nonparametric masked language model.
  30. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022.
  31. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.
  32. E-BERT: Efficient-yet-effective entity embeddings for BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
  34. Impact of pretraining term frequencies on few-shot reasoning.
  35. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  36. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval.
  37. Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  38. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021.
  39. Recitation-augmented language models.
  40. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  41. Generate rather than retrieve: Large language models are strong context generators.
  42. Glm-130b: An open bilingual pre-trained model.
  43. Opt: Open pre-trained transformer language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alex Mallen (10 papers)
  2. Akari Asai (35 papers)
  3. Victor Zhong (25 papers)
  4. Rajarshi Das (27 papers)
  5. Daniel Khashabi (83 papers)
  6. Hannaneh Hajishirzi (176 papers)
Citations (389)
Youtube Logo Streamline Icon: https://streamlinehq.com