Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Conflicts for LLMs: A Survey (2403.08319v2)

Published 13 Mar 2024 in cs.CL, cs.AI, cs.IR, and cs.LG
Knowledge Conflicts for LLMs: A Survey

Abstract: This survey provides an in-depth analysis of knowledge conflicts for LLMs, highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.

Exploring Knowledge Conflicts in LLMs: Categorization, Causes, and Solutions

In the field of LLMs, knowledge conflicts are inevitable due to the vast and diverse sources of information that feed these models. A systematic exploration of this area reveals intricate challenges LLMs face in reconciling contradictions among the information they process. This survey explores the nuanced categories of knowledge conflicts LLMs encounter, namely, context-memory, inter-context, and intra-memory conflicts, each accompanied by unique triggers and behavioral outcomes. Furthermore, it explores the practical and theoretical implications of these conflicts on the trustworthiness and performance of LLMs and discusses the current strategies devised to mitigate these issues.

Context-Memory Conflict

Context-memory conflict arises when an LLM's built-in (parametric) knowledge conflicts with new external (contextual) information supplied during its application. Causes could be categorized broadly into temporal misalignment and misinformation pollution, challenging the trustworthiness and real-time accuracy of models. Analytically, LLMs exhibit a varied yet consistent preference towards semantically coherent, logical, and compelling knowledge, regardless of whether it originates from context or memory. Solution-wise, strategies range from pre-hoc solutions like fine-tuning and knowledge plug-in, aimed at adapting the model towards prioritizing contextual information, to post-hoc strategies like predicting fact validity, to ensure that models can discern and adjust based on the reliability of the information source.

Inter-Context Conflict

Inter-context conflict is marked by discrepancies within the external information retrieved by or fed into the model. Predominantly sourced from misinformation or outdated information, this form of conflict significantly impacts LLM performance. Models struggling with this conflict type show a marked tendency towards over-relying on parametric knowledge when external conflicts arise. Efforts to combat these conflicts include leveraging specialized models to eliminate contradiction or augmenting query strategies to enhance the robustness of models against conflicting information.

Intra-Memory Conflict

Intra-memory conflict refers to the discrepancies within the LLM's parametric knowledge, resulting in inconsistent outputs for semantically equivalent but syntactically different inputs. Such inconsistencies undermine the reliability and utility of LLMs across different applications. The root causes are identified as biases in the training data, decoding strategies, and post-update knowledge editing. Existing research suggests a focus on improving the consistency and factuality of model responses, with proposed solutions encompassing both the training phase and the post-hoc processing phase, aiming to refine parameter knowledge and regulate model behavior.

Challenges and Future Directions

The survey outlines several unaddressed challenges and potential research directions. A critical concern is the practicality of current solutions, which mostly address artificially constructed knowledge conflicts, highlighting the need for studies exploring conflicts "in the wild," especially in the context of retrieval-augmented systems. Furthermore, there's a call for deeper investigation into the interplay among different types of conflicts and their compounded effects on LLM behavior. The ethics and implications of handling misinformation, especially in sensitive applications, also remain an area ripe for exploration.

Conclusion

This survey provides a comprehensive overview of the current state of research on knowledge conflicts in LLMs, offering insights into conflict categorization, underlying causes, model behaviors, and resolution strategies. As LLMs continue to evolve and integrate into various aspects of technology and daily life, understanding and addressing knowledge conflicts will be paramount in ensuring their trustworthiness, reliability, and utility.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (215)
  1. Gpt-4 technical report.
  2. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
  3. Do language models know when they’re hallucinating references? ArXiv preprint, abs/2305.18248.
  4. Towards tracing knowledge in language models back to the training data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2429–2446.
  5. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736.
  6. The looming threat of fake and llm-generated linkedin profiles: Challenges and opportunities for detection and prevention. In Proceedings of the 34th ACM Conference on Hypertext and Social Media, pages 1–10.
  7. Identifying and mitigating the security risks of generative ai. Foundations and Trends® in Privacy and Security, 6(1):1–52.
  8. Self-consistency of large language models under ambiguity. ArXiv preprint, abs/2310.13439.
  9. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  10. Managing ai risks in an era of rapid progress. ArXiv preprint, abs/2310.17688.
  11. SubjQA: A Dataset for Subjectivity and Review Comprehension. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5480–5494, Online. Association for Computational Linguistics.
  12. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  13. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  14. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  15. On the application of large language models for language teaching and assessment technology. ArXiv preprint, abs/2307.08393.
  16. Poisoning web-scale training datasets is practical. ArXiv preprint, abs/2302.10149.
  17. Tyler A Chang and Benjamin K Bergen. 2023. Language model behavior: A comprehensive survey. ArXiv preprint, abs/2303.11504.
  18. Can lms generalize to future data? an empirical analysis on text summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16205–16217.
  19. Canyu Chen and Kai Shu. 2023a. Can llm-generated misinformation be detected? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  20. Canyu Chen and Kai Shu. 2023b. Combating misinformation in the age of llms: Opportunities and challenges. ArXiv preprint, abs/2311.05656.
  21. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. ArXiv preprint, abs/2210.13701.
  22. Say what you mean! large language models speak too positively about negative commonsense knowledge. ArXiv preprint, abs/2305.05976.
  23. Benchmarking large language models in retrieval-augmented generation. ArXiv preprint, abs/2309.01431.
  24. Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6325–6341.
  25. A dataset for answering time-sensitive questions. ArXiv preprint, abs/2108.06314.
  26. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. ArXiv preprint, abs/2307.13528.
  27. Tsun-Hin Cheung and Kin-Man Lam. 2023. Factllama: Optimizing instruction-following language models with external knowledge for automated fact-checking. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 846–853. IEEE.
  28. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  29. Dola: Decoding by contrasting layers improves factuality in large language models. ArXiv preprint, abs/2309.03883.
  30. Summing up the facts: Additive mechanisms behind factual recall in llms. ArXiv preprint, abs/2402.07321.
  31. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  32. Does bert solve commonsense task via commonsense knowledge. ArXiv preprint, abs/2008.03945.
  33. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506.
  34. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
  35. Daryna Dementieva and Alexander Panchenko. 2021. Cross-lingual evidence improves monolingual fake news detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 310–320, Online. Association for Computational Linguistics.
  36. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273.
  37. Chain-of-verification reduces hallucination in large language models. ArXiv preprint, abs/2309.11495.
  38. Statistical knowledge assessment for large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  39. e-care: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446.
  40. Synthetic disinformation attacks on automated fact verification systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10581–10589.
  41. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. ArXiv preprint, abs/2104.08455.
  42. Measuring causal effects of data statistics on language model’sfactual’predictions. ArXiv preprint, abs/2207.14251.
  43. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
  44. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  45. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
  46. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. ArXiv preprint, abs/2311.05876.
  47. Emilio Ferrara. 2023. Genai against humanity: Nefarious applications of generative artificial intelligence and large language models. ArXiv preprint, abs/2310.00737.
  48. Luciano Floridi. 2023. Ai as agency without intelligence: on chatgpt, large language models, and other generative models. Philosophy & Technology, 36(1):15.
  49. The battlefront of combating misinformation and coping with media bias. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4790–4791.
  50. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508.
  51. Retrieval-augmented generation for large language models: A survey. ArXiv preprint, abs/2312.10997.
  52. Trueteacher: Learning factual consistency evaluation with large language models. ArXiv preprint, abs/2305.11171.
  53. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  54. Generative language models and automated influence operations: Emerging threats and potential mitigations. ArXiv preprint, abs/2301.04246.
  55. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv e-prints, pages arXiv–2302.
  56. Studying large language model generalization with influence functions. ArXiv preprint, abs/2308.03296.
  57. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  58. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  59. Methods for measuring, updating, and visualizing factual beliefs in language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2706–2723.
  60. Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1121–1133, Online. Association for Computational Linguistics.
  61. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701.
  62. Training compute-optimal large language models. ArXiv preprint, abs/2203.15556.
  63. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  64. Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators. ArXiv preprint, abs/2305.01579.
  65. Wikicontradiction: Detecting self-contradiction articles on wikipedia. In 2021 IEEE International Conference on Big Data (Big Data), pages 427–436. IEEE.
  66. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ArXiv preprint, abs/2311.05232.
  67. Editing models with task arithmetic. ArXiv preprint, abs/2212.04089.
  68. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6237–6250.
  69. Towards continual knowledge learning of language models. In International Conference on Learning Representations.
  70. Myeongjun Erik Jang and Thomas Lukasiewicz. 2023. Improving language models meaning understanding and consistency by learning conceptual roles from dictionary. ArXiv preprint, abs/2310.15541.
  71. Automatic detection of machine generated text: A critical survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2296–2309, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  72. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  73. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  74. Disinformation detection: An evolving challenge in the age of llms. ArXiv preprint, abs/2309.15847.
  75. Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models. ArXiv preprint, abs/2402.14409.
  76. Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models. ArXiv preprint, abs/2402.18154.
  77. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  78. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer.
  79. Challenges and applications of large language models. ArXiv preprint, abs/2307.10169.
  80. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
  81. Cheongwoong Kang and Jaesik Choi. 2023. Impact of co-occurrence on factual knowledge of large language models. ArXiv preprint, abs/2310.08256.
  82. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  83. Realtime qa: What’s the answer right now? ArXiv preprint, abs/2207.13332.
  84. Celeste Kidd and Abeba Birhane. 2023. How ai can distort human beliefs. Science, 380(6651):1222–1223.
  85. Claimdiff: Comparing and contrasting claims on contentious issues. ArXiv preprint, abs/2205.12221.
  86. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  87. Srijan Kumar and Neil Shah. 2018. False information on web and social media: A survey. ArXiv preprint, abs/1804.08559.
  88. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  89. Analyzing the use of influence functions for instance-specific data filtering in neural machine translation. ArXiv preprint, abs/2210.13281.
  90. Internet-augmented language models through few-shot prompting for open-domain question answering. ArXiv preprint, abs/2203.05115.
  91. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34:29348–29363.
  92. Plug-and-play adaptation for continuously-updated qa. In Findings of the Association for Computational Linguistics: ACL 2022, pages 438–447.
  93. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599.
  94. Detecting misinformation with llm-predicted credibility signals and weak supervision. ArXiv preprint, abs/2309.07601.
  95. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  96. Large language models with controllable working memory. ArXiv preprint, abs/2211.05110.
  97. Contradoc: Understanding self-contradictions in documents with large language models. ArXiv preprint, abs/2311.09182.
  98. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR.
  99. Inference-time intervention: Eliciting truthful answers from a language model. ArXiv preprint, abs/2306.03341.
  100. How pre-trained language models capture factual knowledge? a causal-inspired analysis. ArXiv preprint, abs/2203.16747.
  101. Contrastive decoding: Open-ended text generation as optimization. ArXiv preprint, abs/2210.15097.
  102. Benchmarking and improving generator-validator consistency of language models. ArXiv preprint, abs/2310.01846.
  103. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, pages 374–382.
  104. Unveiling the pitfalls of knowledge editing for large language models. ArXiv preprint, abs/2310.02129.
  105. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  106. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, pages 13604–13622. PMLR.
  107. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  108. Prompt injection attack against llm-integrated applications. ArXiv preprint, abs/2306.05499.
  109. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063.
  110. Time waits for no one! analysis and challenges of temporal misalignment. ArXiv preprint, abs/2111.07408.
  111. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. ArXiv preprint, abs/2212.10511.
  112. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.
  113. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv preprint, abs/2303.08896.
  114. Dynamic benchmarking of masked language models on temporal concept drift with multiple views. ArXiv preprint, abs/2302.12297.
  115. Better call gpt, comparing large language models against lawyers. ArXiv preprint, abs/2401.16212.
  116. How decoding strategies affect the verifiability of generated text. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 223–235, Online. Association for Computational Linguistics.
  117. Priyanka Meel and Dinesh Kumar Vishwakarma. 2020. Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities. Expert Systems with Applications, 153:112986.
  118. Addressing the harms of ai-generated inauthentic content. Nature Machine Intelligence, 5(7):679–680.
  119. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  120. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  121. Large language models challenge the future of higher education. Nature Machine Intelligence, 5(4):333–334.
  122. Fast model editing at scale. ArXiv preprint, abs/2110.11309.
  123. Enhancing self-consistency and performance of pre-trained language models through natural language inference. ArXiv preprint, abs/2211.11875.
  124. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. ArXiv preprint, abs/2305.15852.
  125. A comprehensive overview of large language models. ArXiv preprint, abs/2307.06435.
  126. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. ArXiv preprint, abs/2211.05655.
  127. Raymond S Nickerson. 1998. Confirmation bias: A ubiquitous phenomenon in many guises. Review of general psychology, 2(2):175–220.
  128. Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses. CoRR.
  129. Can lms learn new entities from descriptions? challenges in propagating injected knowledge. ArXiv preprint, abs/2305.01651.
  130. OpenAI. 2023. Chatgpt.
  131. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  132. Attacking open-domain question answering by injecting misinformation. IJCNLP-AACL. ACL.
  133. Knowledge-in-context: Towards knowledgeable semi-parametric language models. In The Eleventh International Conference on Learning Representations.
  134. On the risk of misinformation pollution with large language models. ArXiv preprint, abs/2305.13661.
  135. Check your facts and try again: Improving large language models with external knowledge and automated feedback. ArXiv preprint, abs/2302.12813.
  136. Discovering language model behaviors with model-written evaluations. ArXiv preprint, abs/2212.09251.
  137. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  138. A linguistic investigation of machine learning based contradiction detection models: an empirical analysis and future perspectives. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1649–1653. IEEE.
  139. Yuval Pinter and Michael Elhadad. 2023. Emptying the ocean with a spoon: Should we edit models? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15164–15172.
  140. Cross-lingual consistency of factual knowledge in multilingual language models. ArXiv preprint, abs/2310.10378.
  141. " merge conflicts!" exploring the impacts of external distractors to parametric knowledge graphs. ArXiv preprint, abs/2309.08594.
  142. Predicting question-answering performance of large language models through semantic consistency. ArXiv preprint, abs/2311.01152.
  143. Semantic consistency for assuring reliability of large language models. ArXiv preprint, abs/2308.09138.
  144. Measuring reliability of large language models through semantic consistency. ArXiv preprint, abs/2211.05853.
  145. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  146. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  147. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  148. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.
  149. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761.
  150. The cost of training nlp models: A concise overview. ArXiv preprint, abs/2004.08900.
  151. Towards understanding sycophancy in language models. ArXiv preprint, abs/2310.13548.
  152. Trusting your evidence: Hallucinate less with context-aware decoding. ArXiv preprint, abs/2305.14739.
  153. In-context pretraining: Language modeling beyond document boundaries. ArXiv preprint, abs/2310.10638.
  154. Replug: Retrieval-augmented black-box language models. ArXiv preprint, abs/2301.12652.
  155. Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter, 19(1):22–36.
  156. A comprehensive evaluation of large language models on legal judgment prediction. ArXiv preprint, abs/2310.11761.
  157. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803.
  158. Large language models encode clinical knowledge. ArXiv preprint, abs/2212.13138.
  159. Editable neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  160. Craig S. Smith. 2023. What large models cost you – there is no free ai lunch.
  161. Evaluating the social impact of generative ai systems in systems and society. ArXiv preprint, abs/2306.05949.
  162. Ai model gpt-3 (dis) informs us better than humans. ArXiv preprint, abs/2301.11924.
  163. Blinded by generated contexts: How language models merge generated and retrieved contexts for open-domain qa? ArXiv preprint, abs/2401.11911.
  164. The science of detecting llm-generated texts. ArXiv preprint, abs/2303.07205.
  165. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
  166. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  167. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288.
  168. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
  169. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv preprint, abs/2305.04388.
  170. Joseph E Uscinski and Ryden W Butler. 2013. The epistemology of fact checking. Critical Review, 25(2):162–180.
  171. Comparing gpt-4 and open-source language models in misinformation mitigation. ArXiv preprint, abs/2401.06920.
  172. Freshllms: Refreshing large language models with search engine augmentation. ArXiv preprint, abs/2310.03214.
  173. What evidence do language models find convincing? ArXiv preprint, abs/2402.11782.
  174. Does it make sense? and why? a pilot study for sense making and explanation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4020–4026, Florence, Italy. Association for Computational Linguistics.
  175. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity.
  176. Exploiting Abstract Meaning Representation for open-domain question answering. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2083–2096, Toronto, Canada. Association for Computational Linguistics.
  177. RFiD: Towards rational fusion-in-decoder for open-domain question answering. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2473–2481, Toronto, Canada. Association for Computational Linguistics.
  178. A causal view of entity bias in (large) language models. ArXiv preprint, abs/2305.14695.
  179. Cross-lingual knowledge editing in large language models. ArXiv preprint, abs/2309.08952.
  180. Incorporating neuro-inspired adaptability for continual learning in artificial intelligence. Nature Machine Intelligence, pages 1–13.
  181. Resolving knowledge conflicts in large language models. ArXiv preprint, abs/2310.00935.
  182. Simple synthetic data reduces sycophancy in large language models. ArXiv preprint, abs/2308.03958.
  183. Ethical and social risks of harm from language models. ArXiv preprint, abs/2112.04359.
  184. Sociotechnical safety evaluation of generative ai systems. ArXiv preprint, abs/2310.11986.
  185. Defending against misinformation attacks in open-domain question answering. ArXiv preprint, abs/2212.10002.
  186. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  187. Topological analysis of contradictions in text. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2478–2483.
  188. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  189. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge conflicts. ArXiv preprint, abs/2305.13300.
  190. Does your model classify entities reasonably? diagnosing and mitigating spurious correlations in entity typing. ArXiv preprint, abs/2205.12640.
  191. The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. ArXiv preprint, abs/2312.09085.
  192. Improving factual consistency for knowledge-grounded dialogue systems via knowledge enhancement and alignment. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7829–7844.
  193. Editing large language models: Problems, methods, and opportunities. ArXiv preprint, abs/2305.13172.
  194. Benchmarking and defending against indirect prompt injection attacks on large language models. ArXiv preprint, abs/2312.14197.
  195. Intuitive or dependent? investigating llms’ robustness to conflicting prompts. ArXiv preprint, abs/2309.17415.
  196. Generate rather than retrieve: Large language models are strong context generators. ArXiv preprint, abs/2209.10063.
  197. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  198. Enhancing financial sentiment analysis via retrieval augmented large language models. In Proceedings of the Fourth ACM International Conference on AI in Finance, pages 349–356.
  199. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553.
  200. Sac33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. ArXiv preprint, abs/2311.01740.
  201. Michael JQ Zhang and Eunsol Choi. 2021. Situatedqa: Incorporating extra-linguistic contexts into qa. ArXiv preprint, abs/2109.06157.
  202. Michael JQ Zhang and Eunsol Choi. 2023. Mitigating temporal misalignment by discarding outdated facts. ArXiv preprint, abs/2305.14824.
  203. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics.
  204. Siren’s song in the ai ocean: A survey on hallucination in large language models.
  205. Merging generated and retrieved knowledge for open-domain qa. ArXiv preprint, abs/2310.14393.
  206. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology.
  207. Knowing what llms do not know: A simple yet effective self-detection method. ArXiv preprint, abs/2310.17918.
  208. Cdconv: A benchmark for contradiction detection in chinese conversations. ArXiv preprint, abs/2210.08511.
  209. Mquake: Assessing knowledge editing in language models via multi-hop questions. ArXiv preprint, abs/2305.14795.
  210. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206.
  211. A survey of large language models in medicine: Progress, application, and challenge. ArXiv preprint, abs/2311.05112.
  212. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–20.
  213. Context-faithful prompting for large language models. ArXiv preprint, abs/2303.11315.
  214. Toolqa: A dataset for llm question answering with external tools. ArXiv preprint, abs/2306.13304.
  215. Detection and resolution of rumours in social media: A survey. ACM Computing Surveys (CSUR), 51(2):1–36.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Rongwu Xu (19 papers)
  2. Zehan Qi (13 papers)
  3. Cunxiang Wang (30 papers)
  4. Hongru Wang (62 papers)
  5. Yue Zhang (618 papers)
  6. Wei Xu (535 papers)
  7. Zhijiang Guo (55 papers)
Citations (41)
Youtube Logo Streamline Icon: https://streamlinehq.com