Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models (2402.14409v1)

Published 22 Feb 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Retrieval-augmented LLMs (RALMs) have demonstrated significant potential in refining and expanding their internal memory by retrieving evidence from external sources. However, RALMs will inevitably encounter knowledge conflicts when integrating their internal memory with external sources. Knowledge conflicts can ensnare RALMs in a tug-of-war between knowledge, limiting their practical applicability. In this paper, we focus on exploring and resolving knowledge conflicts in RALMs. First, we present an evaluation framework for assessing knowledge conflicts across various dimensions. Then, we investigate the behavior and preference of RALMs from the following two perspectives: (1) Conflicts between internal memory and external sources: We find that stronger RALMs emerge with the Dunning-Kruger effect, persistently favoring their faulty internal memory even when correct evidence is provided. Besides, RALMs exhibit an availability bias towards common knowledge; (2) Conflicts between truthful, irrelevant and misleading evidence: We reveal that RALMs follow the principle of majority rule, leaning towards placing trust in evidence that appears more frequently. Moreover, we find that RALMs exhibit confirmation bias, and are more willing to choose evidence that is consistent with their internal memory. To solve the challenge of knowledge conflicts, we propose a method called Conflict-Disentangle Contrastive Decoding (CD2) to better calibrate the model's confidence. Experimental results demonstrate that our CD2 can effectively resolve knowledge conflicts in RALMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877.
  2. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292–2307, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  6. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431.
  7. DISCO: Distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528, Toronto, Canada. Association for Computational Linguistics.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. David Dunning. 2011. The dunning–kruger effect: On being ignorant of one’s own ignorance. In Advances in experimental social psychology, volume 44, pages 247–296. Elsevier.
  10. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  11. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  12. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  13. TemporalWiki: A lifelong benchmark for training and evaluating ever-evolving language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6237–6250, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  14. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  15. Instructor: Instructing unsupervised conversational dense retrieval with large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  16. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
  17. Justin Kruger and David Dunning. 1999. Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of personality and social psychology, 77(6):1121.
  18. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  19. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toronto, Canada. Association for Computational Linguistics.
  20. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  21. DisentQA: Disentangling parametric and contextual knowledge with counterfactual question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10056–10070, Toronto, Canada. Association for Computational Linguistics.
  22. Raymond S Nickerson. 1998. Confirmation bias: A ubiquitous phenomenon in many guises. Review of general psychology, 2(2):175–220.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. " merge conflicts!" exploring the impacts of external distractors to parametric knowledge graphs. arXiv preprint arXiv:2309.08594.
  25. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  26. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv preprint arXiv:2307.11019.
  27. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
  28. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  29. The confidence-competence gap in large language models: A cognitive study. arXiv preprint arXiv:2309.16145.
  30. The curious case of hallucinatory unanswerablity: Finding truths in the hidden states of over-confident large language models. arXiv preprint arXiv:2310.11877.
  31. Head-to-tail: How knowledgeable are large language models (llm)? aka will llms replace knowledge graphs? arXiv preprint arXiv:2308.10168.
  32. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Resolving knowledge conflicts in large language models. arXiv preprint arXiv:2310.00935.
  35. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge conflicts. arXiv preprint arXiv:2305.13300.
  36. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  37. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  38. Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1, knowledge storage and extraction.
  39. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  40. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics, 7:453–466.
  41. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  42. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  43. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhuoran Jin (23 papers)
  2. Pengfei Cao (39 papers)
  3. Yubo Chen (58 papers)
  4. Kang Liu (207 papers)
  5. Xiaojian Jiang (5 papers)
  6. Jiexin Xu (5 papers)
  7. Qiuxia Li (1 paper)
  8. Jun Zhao (469 papers)
Citations (30)