Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity (2310.07521v3)

Published 11 Oct 2023 in cs.CL

Abstract: This survey addresses the crucial issue of factuality in LLMs. As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the probability of LLMs to produce content inconsistent with established facts. We first delve into the implications of these inaccuracies, highlighting the potential consequences and challenges posed by factual errors in LLM outputs. Subsequently, we analyze the mechanisms through which LLMs store and process facts, seeking the primary causes of factual errors. Our discussion then transitions to methodologies for evaluating LLM factuality, emphasizing key metrics, benchmarks, and studies. We further explore strategies for enhancing LLM factuality, including approaches tailored for specific domains. We focus two primary LLM configurations standalone LLMs and Retrieval-Augmented LLMs that utilizes external data, we detail their unique challenges and potential enhancements. Our survey offers a structured guide for researchers aiming to fortify the factual reliability of LLMs.

Survey on Factuality in LLMs: Knowledge, Retrieval and Domain-Specificity

The paper "Survey on Factuality in LLMs: Knowledge, Retrieval and Domain-Specificity" is a comprehensive examination of the factual reliability of LLMs. As LLMs become integral to various applications, ensuring their output is factually accurate is crucial. This paper systematically explores the concerns regarding factuality in LLMs, presenting a detailed analysis of the mechanisms and strategies involved in enhancing their factual accuracy.

The research discusses the "factuality issue," defined as the likelihood of LLMs generating content inconsistent with established facts. It highlights the implications of these inaccuracies, shedding light on the potential challenges and consequences posed by factual errors in LLM-generated outputs. The authors provide a structured examination of methodologies for evaluating LLM factuality, placing emphasis on key metrics, benchmarks, and recent studies. Various strategies for improving factual accuracy, particularly through domain-specific approaches, are discussed.

The paper scrutinizes two main LLM configurations: standalone LLMs and retrieval-augmented LLMs. Standalone LLMs operate independently without external data inputs, whereas retrieval-augmented versions harness external data to refine their outputs. Each configuration comes with its set of challenges and opportunities for enhancement. The paper systematically reviews methods aimed at improving the factuality of LLMs in both settings, providing a valuable resource for researchers aiming to enhance the reliability of LLMs.

The authors also focus on domain-specific LLM applications. These involve tailoring LLMs to specific domains, such as medicine, finance, and law, where factual accuracy is particularly critical. The survey discusses various domain-specific enhancements that improve the factual reliability of LLMs, offering insight into how these tailored solutions can lead to more accurate and dependable model outputs in specialized fields.

Beyond individual methodologies, the survey emphasizes the importance of a holistic approach to addressing factual inaccuracies in LLMs. By synthesizing research from different domains and approaches, it provides a cohesive guide for fortifying the factual reliability of these models, ensuring they can serve as reliable tools in various academic and practical applications.

In conclusion, this paper serves as a resource for understanding and enhancing the factual accuracy of LLMs. It addresses a pivotal issue in the broader application of AI, providing researchers with the necessary insights and methodologies to develop more factually accurate models. The survey's comprehensive approach to evaluating, analyzing, and improving LLM factuality exemplifies the collaborative efforts required to ensure the ongoing utility and trustworthiness of these advanced computational tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (312)
  1. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3554–3565. https://doi.org/10.18653/v1/2021.naacl-main.278
  2. A Review on Language Models as Knowledge Bases. arXiv:2204.06031 [cs.CL]
  3. Falcon-40B: an open large language model with state-of-the-art performance. (2023).
  4. The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task. In Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Dominican Republic, 1–13. https://doi.org/10.18653/v1/2021.fever-1.1
  5. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv preprint arXiv:2310.11511 (2023). https://arxiv.org/abs/2310.11511
  6. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734 (2023).
  7. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. arXiv:2306.04136 [cs.CL]
  8. HouYi: An open-source large language model specially designed for renewable energy and carbon neutrality field. arXiv preprint arXiv:2308.01414 (2023).
  9. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
  10. DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation. arXiv preprint arXiv:2308.14346 (2023).
  11. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
  12. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1533–1544. https://aclanthology.org/D13-1160
  13. The Reversal Curse: LLMs trained on ”A is B” fail to learn ”B is A”. arXiv:2309.12288 [cs.CL]
  14. OceanGPT: A Large Language Model for Ocean Science Tasks. arXiv:2310.02031 [cs.CL]
  15. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata..
  16. Improving language models by retrieving from trillions of tokens. arXiv:2112.04426 [cs.CL]
  17. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  18. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]
  19. Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3340–3354. https://doi.org/10.18653/v1/2022.acl-long.236
  20. Evaluation of Text Generation: A Survey. In arXiv preprint arXiv:2006.14799.
  21. A Survey on Evaluation of Large Language Models. arXiv preprint arXiv:2307.03109 (2023).
  22. Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
  23. PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions. arXiv preprint arXiv:2305.14908 (2023).
  24. Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2292–2307.
  25. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv:2309.01431 [cs.CL]
  26. Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. arXiv preprint arXiv:2310.07289 (2023).
  27. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
  28. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7870–7881.
  29. Open Question Answering over Tables and Text. arXiv:2010.10439 [cs.CL]
  30. Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons. arXiv:2308.13198 [cs.CL]
  31. Phoenix: Democratizing ChatGPT across Languages. arXiv preprint arXiv:2304.10453 (2023).
  32. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv:2307.13528 [cs.CL]
  33. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  34. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  35. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. arXiv preprint arXiv:2309.03883 (2023).
  36. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416 [cs.LG]
  37. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]
  38. Claude. 2023. Introducing Claude. https://www.anthropic.com/index/introducing-claude
  39. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
  40. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
  41. Evaluating the Ripple Effects of Knowledge Editing in Language Models. arXiv:2307.12976 [cs.CL]
  42. LM vs LM: Detecting Factual Errors via Cross Examination. arXiv:2305.13281 [cs.CL]
  43. Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
  44. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092 (2023).
  45. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. arXiv:2306.16092 [cs.CL]
  46. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. arXiv preprint arXiv:2304.08177 (2023). https://arxiv.org/abs/2304.08177
  47. Curation. 2020. Curation Corpus Base.
  48. Hallucination is the last thing you need. arXiv:2306.11520 [cs.CL]
  49. Knowledge Neurons in Pretrained Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8493–8502. https://doi.org/10.18653/v1/2022.acl-long.581
  50. Editing Factual Knowledge in Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6491–6506. https://doi.org/10.18653/v1/2021.emnlp-main.522
  51. Mention Memory: incorporating textual knowledge into Transformers through entity mention attention. In International Conference on Learning Representations. https://openreview.net/forum?id=OY1A8ejQgEX
  52. Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. arXiv preprint arXiv:2306.05064 (2023).
  53. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  54. Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv preprint arXiv:2309.11495 (2023).
  55. Wizard of Wikipedia: Knowledge-Powered Conversational agents. arXiv:1811.01241 [cs.CL]
  56. Calibrating Factual Knowledge in Pretrained Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5937–5947. https://doi.org/10.18653/v1/2022.findings-emnlp.438
  57. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]
  58. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL]
  59. Fool Me Twice: Entailment from Wikipedia Gamification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 352–365. https://doi.org/10.18653/v1/2021.naacl-main.32
  60. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.
  61. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217 [cs.CL]
  62. Scaling Language Models: Methods, Analysis & Insights from Training Gopher.
  63. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3558–3567. https://doi.org/10.18653/v1/P19-1346
  64. GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning. arXiv preprint arXiv:2307.13923 (2023).
  65. Robert M Fano and David Hawkins. 1961. Transmission of information: A statistical theory of communications. American Journal of Physics 29, 11 (1961), 793–794.
  66. LawBench: Benchmarking Legal Knowledge of Large Language Models. arXiv preprint arXiv:2309.16289 (2023).
  67. WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9126–9140.
  68. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1137–1147. https://doi.org/10.18653/v1/2020.emnlp-main.86
  69. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
  70. Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs. arXiv:2311.00681 [cs.CL]
  71. Entities as Experts: Sparse Memory Access with Entity Supervision. arXiv:2004.07202 [cs.CL]
  72. Bias and Fairness in Large Language Models: A Survey. arXiv preprint arXiv:2309.00770 (2023).
  73. Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop.
  74. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]
  75. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16477–16508.
  76. Enabling Large Language Models to Generate Text with Citations. arXiv:2305.14627 [cs.CL]
  77. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3356–3369. https://doi.org/10.18653/v1/2020.findings-emnlp.301
  78. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. CoRR abs/2304.14767 (2023). https://doi.org/10.48550/arXiv.2304.14767 arXiv:2304.14767
  79. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 30–45. https://aclanthology.org/2022.emnlp-main.3
  80. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics 9 (2021), 346–361. https://doi.org/10.1162/tacl_a_00370
  81. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 9 (2021), 346–361.
  82. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5484–5495. https://doi.org/10.18653/v1/2021.emnlp-main.446
  83. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. arXiv:1312.6211 [stat.ML]
  84. Google. 2023. Bard. bard.google.com (2023).
  85. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. arXiv:2305.11738 [cs.CL]
  86. Cohortgpt: An enhanced gpt for participant recruitment in clinical study. arXiv preprint arXiv:2307.11346 (2023).
  87. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv:2308.11462 (2023).
  88. INFOTABS: Inference on Tables as Semi-structured Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2309–2324. https://doi.org/10.18653/v1/2020.acl-main.210
  89. INFOTABS: Inference on tables as semi-structured data. arXiv preprint arXiv:2005.06117 (2020).
  90. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211–225. https://doi.org/10.1162/tacl_a_00362
  91. Rethinking with Retrieval: Faithful Large Language Model Inference. arXiv:2301.00303 [cs.CL]
  92. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
  93. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: A pilot study. International journal of environmental research and public health 20, 4 (2023), 3378.
  94. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6609–6625. https://doi.org/10.18653/v1/2020.coling-main.580
  95. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 5352–5367. https://doi.org/10.18653/v1/2023.acl-long.293
  96. Wenpin Hou and Zhicheng Ji. 2023. GeneTuring tests GPT models in genomics. bioRxiv (2023).
  97. Parameter-Efficient Transfer Learning for NLP. arXiv:1902.00751 [cs.LG]
  98. Do Large Language Models Know about Facts? arXiv preprint arXiv:2310.05177 (2023).
  99. BSChecker for Fine-grained Hallucination Detection. (2023). https://github.com/amazon-science/bschecker-for-fine-grained-hallucination-detection
  100. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards Reasoning in Large Language Models: A Survey. arXiv preprint arXiv:2212.10403 (2022).
  101. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. arXiv:2212.10403 [cs.CL]
  102. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ArXiv abs/2311.05232 (2023). https://api.semanticscholar.org/CorpusID:265067168
  103. Lawyer LLaMA Technical Report. arXiv preprint arXiv:2305.15062 (2023).
  104. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv preprint arXiv:2305.08322 (2023).
  105. Transformer-Patcher: One Mistake Worth One Neuron. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=4oYUGeGBPm
  106. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).
  107. Gautier Izacard and Edouard Grave. 2021a. Distilling Knowledge from Reader to Retriever for Question Answering. In ICLR 2021 - 9th International Conference on Learning Representations. Vienna, Austria.
  108. Gautier Izacard and Edouard Grave. 2021b. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74
  109. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv:2208.03299 [cs.CL]
  110. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (mar 2023), 38 pages. https://doi.org/10.1145/3571730
  111. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  112. TempQuestions: A Benchmark for Temporal Question Answering. In Companion Proceedings of the The Web Conference 2018 (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1057–1062. https://doi.org/10.1145/3184558.3191536
  113. StructGPT: A general framework for Large Language Model to Reason on Structured Data. arXiv preprint arXiv:2305.09645. https://arxiv.org/pdf/2305.09645.pdf
  114. FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 318–323. https://doi.org/10.18653/v1/N19-1028
  115. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3441–3460. https://doi.org/10.18653/v1/2020.findings-emnlp.309
  116. Active Retrieval Augmented Generation. arXiv:2305.06983 [cs.CL]
  117. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv (2023).
  118. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Vancouver, Canada.
  119. Language Models (Mostly) Know What They Know. arXiv:2207.05221 [cs.CL]
  120. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning. PMLR, 15696–15707.
  121. KALA: Knowledge-Augmented Language Model Adaptation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 5144–5167. https://doi.org/10.18653/v1/2022.naacl-main.379
  122. RealTime QA: What’s the Answer Right Now? https://arxiv.org/abs/2207.13332
  123. Hey AI, Can You Solve Complex Tasks by Talking to Agents? arXiv:2110.08542 [cs.CL]
  124. Decomposed Prompting: A Modular Approach for Solving Complex Tasks. arXiv:2210.02406 [cs.CL]
  125. Understanding Catastrophic Forgetting in Language Models via Implicit Inference. arXiv:2309.10105 [cs.CL]
  126. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health 2, 2 (2023), e0000198.
  127. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics (2019).
  128. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv:2203.05115 [cs.CL]
  129. Neural Text Generation from Structured Data with Application to the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 1203–1213. https://doi.org/10.18653/v1/D16-1128
  130. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8424–8445. https://doi.org/10.18653/v1/2022.acl-long.577
  131. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems 35 (2022), 34586–34599.
  132. Douglas B. Lenat. 1995. CYC: A Large-Scale Investment in Knowledge Infrastructure. Commun. ACM 38, 11 (nov 1995), 33–38. https://doi.org/10.1145/219717.219745
  133. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, Vancouver, Canada, 333–342. https://doi.org/10.18653/v1/K17-1034
  134. Large Language Models with Controllable Working Memory. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 1774–1793. https://doi.org/10.18653/v1/2023.findings-acl.112
  135. MultiSpanQA: A Dataset for Multi-Span Question Answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 1250–1260. https://doi.org/10.18653/v1/2022.naacl-main.90
  136. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. arXiv:2305.11747 [cs.CL]
  137. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv preprint arXiv:2305.01526 (2023).
  138. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv:2306.03341 [cs.LG]
  139. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097 (2022).
  140. A Survey on Truth Discovery. SIGKDD Explor. 17, 2 (2015), 1–16. https://doi.org/10.1145/2897350.2897352
  141. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 15, 6 (2023).
  142. EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. arXiv preprint arXiv:2308.06966 (2023).
  143. Decoupled Context Processing for Context Augmented Language Modeling. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=02dbnEbEFn
  144. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out.
  145. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334 (2022).
  146. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
  147. Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models. arXiv preprint arXiv:2305.13711 (2023).
  148. Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey. arXiv:2305.18703 [cs.CL]
  149. Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models. arXiv preprint arXiv:2305.18703 (2023).
  150. We’re Afraid Language Models Aren’t Modeling Ambiguity. arXiv preprint arXiv:2304.14399 (2023).
  151. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. arXiv:2306.14565 [cs.CV]
  152. Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
  153. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023).
  154. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 4140–4170. https://doi.org/10.18653/v1/2023.acl-long.228
  155. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
  156. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032 (2023).
  157. MolXPT: Wrapping Molecules with Text for Generative Pre-training. arXiv preprint arXiv:2305.10688 (2023).
  158. Entity-Based Knowledge Conflicts in Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 7052–7063.
  159. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
  160. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  161. A rigorous study of integrated gradients method and extensions to internal neuron attributions. In International Conference on Machine Learning. PMLR, 14485–14508.
  162. SAIL: Search-Augmented Instruction Learning. arXiv preprint arXiv:2305.15225 (2023).
  163. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23, 6 (2022), bbac409.
  164. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. arXiv:2308.08747 [cs.CL]
  165. Stable Beluga models. [https://huggingface.co/stabilityai/StableBeluga2](https://huggingface.co/stabilityai/StableBeluga2)
  166. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 9802–9822. https://doi.org/10.18653/v1/2023.acl-long.546
  167. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896 [cs.CL]
  168. How Decoding Strategies Affect the Verifiability of Generated Text. In Findings of the Association for Computational Linguistics: EMNLP 2020. 223–235.
  169. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
  170. A PROPOSAL FOR THE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL INTELLIGENCE. http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html. http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
  171. Alan Melikdjanian. 2018. Captain disillusion’s escape from the USSR. https://www.youtube.com/watch?v=MaDz0FCxzR8 (Feb 2018).
  172. Locating and Editing Factual Associations in GPT. Advances in Neural Information Processing Systems 36 (2022). https://openreview.net/forum?id=-h6WAS6eE4
  173. Teaching language models to support answers with verified quotes. arXiv:2203.11147 [cs.CL]
  174. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).
  175. Microsoft. 2023. Bing Chat. https://www.bing.com/new (2023).
  176. George A. Miller. 1992. WordNet: A Lexical Database for English. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.
  177. Introduction to WordNet: An On-line Lexical Database*. International Journal of Lexicography 3, 4 (12 1990), 235–244. https://doi.org/10.1093/ijl/3.4.235 arXiv:https://academic.oup.com/ijl/article-pdf/3/4/235/9820417/235.pdf
  178. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv preprint arXiv:2305.14251 (2023). https://arxiv.org/abs/2305.14251
  179. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5783–5797. https://doi.org/10.18653/v1/2020.emnlp-main.466
  180. Fast Model Editing at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=0DcZxeWfOPt
  181. Memory-Based Model Editing at Scale. In International Conference on Machine Learning.
  182. SKILL: Structured Knowledge Infusion for Large Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 1581–1588. https://doi.org/10.18653/v1/2022.naacl-main.113
  183. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786 (2022).
  184. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332 [cs.CL]
  185. DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 10056–10070. https://doi.org/10.18653/v1/2023.acl-long.559
  186. Allen Newell and Herbert A. Simon. 1976. Computer Science as Empirical Inquiry: Symbols and Search. Commun. ACM 19, 3 (mar 1976), 113–126. https://doi.org/10.1145/360018.360022
  187. Ha-Thanh Nguyen. 2023. A Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3. arXiv preprint arXiv:2302.05729 (2023).
  188. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375 [cs.CL]
  189. OpenAI. 2022a. GPT-3.5 - OpenAI. https://platform.openai.com/docs/models/gpt-3-5 (2022).
  190. OpenAI. 2022b. Introducing chatgpt. https://openai.com/blog/chatgpt (2022).
  191. OpenAI. 2023. GPT-4 technical report. arXiv (2023).
  192. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv:2312.05934 [cs.AI]
  193. Fact-Checking Complex Claims with Program-Guided Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 6981–7004. https://doi.org/10.18653/v1/2023.acl-long.386
  194. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv:2306.08302 [cs.CL]
  195. On the Risk of Misinformation Pollution with Large Language Models. arXiv preprint arXiv:2305.13661 (2023).
  196. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1525–1534. https://doi.org/10.18653/v1/P16-1144
  197. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
  198. Daniel Park. 2023. Open-LLM-Leaderboard-Report. https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report
  199. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. arXiv:2302.12813 [cs.CL]
  200. KILT: a Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2523–2544. https://doi.org/10.18653/v1/2021.naacl-main.200
  201. Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2463–2473. https://doi.org/10.18653/v1/D19-1250
  202. Language Models as Knowledge Bases? arXiv:1909.01066 [cs.CL]
  203. Pouya Pezeshkpour. 2023. Measuring and Modifying Factual Knowledge in Large Language Models. arXiv:2306.06264 [cs.CL]
  204. Will Oremus Pranshu Verma. 2023. ChatGPT invented a sexual harassment scandal and named a real law prof as the accused. The Washington Post (2023). https://www.washingtonpost.com/technology/2023/04/05/chatgpt-lies/
  205. Summarization is (Almost) Dead. arXiv:2309.09558 [cs.CL]
  206. FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt. arXiv preprint arXiv:2308.10173 (2023).
  207. WebCPM: Interactive Web Search for Chinese Long-form Question Answering. arXiv:2305.06849 [cs.CL]
  208. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  209. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290 [cs.LG]
  210. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
  211. Measuring attribution in natural language generation models. Computational Linguistics (2023), 1–66.
  212. A Survey of Hallucination in Large Foundation Models. arXiv:2309.05922 [cs.AI]
  213. Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. arXiv:2307.11019 [cs.CL]
  214. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence 4, 12 (2022), 1256–1264.
  215. Unsupervised Improvement of Factual Knowledge in Language Models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 2960–2969. https://doi.org/10.18653/v1/2023.eacl-main.215
  216. Leo Sands. 2023. ChatGPT falsely told voters their mayor was jailed for bribery. He may sue. The Washington Post (2023). https://www.washingtonpost.com/technology/2023/04/06/chatgpt-australia-mayor-lawsuit-lies/
  217. Explaining Legal Concepts with Augmented Large Language Models (GPT-4). arXiv preprint arXiv:2306.09525 (2023).
  218. BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  219. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  220. Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 1604–1619. https://aclanthology.org/2022.coling-1.138
  221. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083 (2022).
  222. ChatGPT and other large language models are double-edged swords. , e230163 pages.
  223. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning. PMLR, 31210–31227.
  224. REPLUG: Retrieval-Augmented Black-Box Language Models. arXiv:2301.12652 [cs.CL]
  225. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]
  226. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  227. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. arXiv:2010.03768 [cs.CL]
  228. End-to-end training of multi-document reader and retriever for open-domain question answering. Advances in Neural Information Processing Systems 34 (2021), 25968–25981.
  229. Large language models encode clinical knowledge. Nature (2023), 1–9.
  230. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
  231. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
  232. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=uyTL5Bvosj
  233. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 8273–8288. https://aclanthology.org/2022.emnlp-main.566
  234. BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver. arXiv:2305.14766 [cs.CL]
  235. Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? AKA Will LLMs Replace Knowledge Graphs? arXiv preprint arXiv:2308.10168 (2023).
  236. Contrastive learning reduces hallucination in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13618–13626.
  237. Aligning Large Multimodal Models with Factually Augmented RLHF. arXiv:2309.14525 [cs.CV]
  238. Evaluating the Factual Consistency of Large Language Models Through News Summarization. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 5220–5255. https://aclanthology.org/2023.findings-acl.322
  239. MedChatZH: a Better Medical Adviser Learns from Better Instructions. arXiv preprint arXiv:2309.01114 (2023).
  240. Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family. arXiv:2303.07992 [cs.CL]
  241. Aligning Factual Consistency for Clinical Studies Summarization through Reinforcement Learning. In Proceedings of the 5th Clinical Natural Language Processing Workshop. Association for Computational Linguistics, Toronto, Canada, 48–58. https://doi.org/10.18653/v1/2023.clinicalnlp-1.7
  242. CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 5657–5668. https://doi.org/10.18653/v1/2022.naacl-main.415
  243. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  244. InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities.
  245. MosaicML NLP Team. 2023b. Introducing MPT-30B: Raising the bar for open-source foundation models. www.mosaicml.com/blog/mpt-30b Accessed: 2023-06-22.
  246. Large language models in medicine. Nature medicine (2023), 1–11.
  247. James Thorne and Andreas Vlachos. 2021. Evidence-based Factual Error Correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3298–3309.
  248. FEVER: a large-scale dataset for Fact Extraction and VERification. arXiv:1803.05355 [cs.CL]
  249. Together. 2023. Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models. https://www.together.xyz/blog/redpajama-models-v1.
  250. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  251. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  252. MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics 10 (2022), 539–554. https://doi.org/10.1162/tacl_a_00475
  253. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 10014–10037. https://doi.org/10.18653/v1/2023.acl-long.557
  254. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. arXiv:2307.03987 [cs.CL]
  255. Biomedlm: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec 23, 3 (2022), 2.
  256. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85.
  257. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. arXiv:2310.03214 [cs.CL]
  258. G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks. arXiv:2212.03613 [cs.CL]
  259. Evaluating Open Question Answering Evaluation. arXiv:2305.12421 [cs.CL]
  260. Can Generative Pre-trained Language Models Serve As Knowledge Bases for Closed-book QA?. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3241–3251. https://doi.org/10.18653/v1/2021.acl-long.251
  261. Knowledgeable Salient Span Mask for Enhancing Language Models as Knowledge Base. In Natural Language Processing and Chinese Computing, Fei Liu, Nan Duan, Qingting Xu, and Yu Hong (Eds.). Springer Nature Switzerland, Cham, 444–456.
  262. RFiD: Towards Rational Fusion-in-Decoder for Open-Domain Question Answering. arXiv:2305.17041 [cs.CL]
  263. HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge. arXiv:2304.06975 [cs.CL]
  264. CMB: A Comprehensive Medical Benchmark in Chinese. arXiv preprint arXiv:2308.08833 (2023).
  265. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
  266. Resolving Knowledge Conflicts in Large Language Models. arXiv preprint arXiv:2310.00935 (2023).
  267. Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering. arXiv preprint arXiv:2309.02233 (2023).
  268. Preserving In-Context Learning ability in Large Language Model Fine-tuning. arXiv preprint arXiv:2211.00635 (2022).
  269. Aligning Large Language Models with Human: A Survey. arXiv preprint arXiv:2307.12966 (2023).
  270. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=_VjQlMeSB_J
  271. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics 6 (05 2018), 287–302. https://doi.org/10.1162/tacl_a_00021
  272. ”According to …” Prompting Language Models Improves Quoting from Pre-Training Data. arXiv:2305.13252 [cs.CL]
  273. ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. arXiv preprint arXiv:2307.15290 (2023).
  274. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564 [cs.LG]
  275. Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. arXiv:2305.13300 [cs.CL]
  276. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. arXiv preprint arXiv:2306.05443 (2023).
  277. DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task. arXiv:2304.01097 [cs.CL]
  278. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097 (2023).
  279. Ming Xu. 2023. MedicalGPT: Training Medical GPT Model. https://github.com/shibing624/MedicalGPT.
  280. Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment. arXiv:2310.08372 [cs.CL]
  281. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023).
  282. ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling. arXiv:2306.11489 [cs.CL]
  283. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. arXiv preprint arXiv:2308.03549 (2023).
  284. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL]
  285. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
  286. Editing Large Language Models: Problems, Methods, and Opportunities. arXiv:2305.13172 [cs.CL]
  287. Cognitive Mirage: A Review of Hallucinations in Large Language Models. arXiv:2309.06794 [cs.CL]
  288. Do Large Language Models Know What They Don’t Know?. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 8653–8665. https://doi.org/10.18653/v1/2023.findings-acl.551
  289. Generate rather than retrieve: Large language models are strong context generators. In International Conference for Learning Representation (ICLR).
  290. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems 34 (2021), 27263–27277.
  291. Automatic Evaluation of Attribution by Large Language Models. arXiv preprint arXiv:2305.06311 (2023).
  292. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. arXiv:2303.01229 [cs.CL]
  293. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
  294. Investigating the Catastrophic Forgetting in Multimodal Large Language Models. arXiv preprint arXiv:2309.10313 (2023).
  295. HuatuoGPT, towards Taming Language Model to Be a Doctor. arXiv preprint arXiv:2305.15075 (2023).
  296. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR abs/2209.02970 (2022).
  297. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534 (2023).
  298. Applications of transformer-based language models in bioinformatics: a survey. Bioinformatics Advances 3, 1 (01 2023), vbad001. https://doi.org/10.1093/bioadv/vbad001
  299. Mitigating Language Model Hallucination with Interactive Question-Knowledge Alignment. arXiv:2305.13669 [cs.CL]
  300. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  301. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations.
  302. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr
  303. Interpretable Unified Language Checking. arXiv:2304.03728 [cs.CL]
  304. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL]
  305. Plug-and-Play Knowledge Injection for Pre-trained Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 10641–10658. https://aclanthology.org/2023.acl-long.594
  306. ”What do others think?”: Task-Oriented Conversational Modeling with Subjective Knowledge. arXiv:2305.12091 [cs.CL]
  307. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023). http://arxiv.org/abs/2303.18223
  308. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364 [cs.CL]
  309. MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions. arXiv:2305.14795 [cs.CL]
  310. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL]
  311. A Dataset for Document Grounded Conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  312. Context-faithful Prompting for Large Language Models. arXiv preprint arXiv:2303.11315 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Cunxiang Wang (30 papers)
  2. Xiaoze Liu (22 papers)
  3. Yuanhao Yue (9 papers)
  4. Xiangru Tang (62 papers)
  5. Tianhang Zhang (16 papers)
  6. Cheng Jiayang (11 papers)
  7. Yunzhi Yao (27 papers)
  8. Wenyang Gao (5 papers)
  9. Xuming Hu (120 papers)
  10. Zehan Qi (13 papers)
  11. Yidong Wang (43 papers)
  12. Linyi Yang (52 papers)
  13. Jindong Wang (150 papers)
  14. Xing Xie (220 papers)
  15. Zheng Zhang (486 papers)
  16. Yue Zhang (618 papers)
Citations (131)
Youtube Logo Streamline Icon: https://streamlinehq.com