HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild (2403.04307v3)
Abstract: Hallucinations pose a significant challenge to the reliability of LLMs in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/HaluEval-Wild/HaluEval-Wild.
- Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
- A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016. Proceedings 38, pages 716–722. Springer.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Canyu Chen and Kai Shu. 2023a. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788.
- Canyu Chen and Kai Shu. 2023b. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656.
- Learningq: a large-scale dataset for educational question generation. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12.
- Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. arXiv preprint arXiv:2310.07289.
- Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. arXiv preprint arXiv:2312.17484.
- Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Diving deep into modes of fact hallucinations in dialogue systems. arXiv preprint arXiv:2301.04449.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
- On the origin of hallucinations in conversational models: Is it the datasets or the models? arXiv preprint arXiv:2204.07931.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
- Towards mitigating hallucination in large language models via self-reflection. arXiv preprint arXiv:2310.06271.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Ever: Mitigating hallucination in large language models through real-time verification and rectification. arXiv preprint arXiv:2311.09114.
- Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(1):170.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. arXiv preprint arXiv:2311.15296.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
- Michelle M Mello and Neel Guha. 2023. Chatgpt and physicians’ malpractice risk. In JAMA Health Forum, volume 4, pages e231938–e231938. American Medical Association.
- Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 845–854.
- OpenAI. 2022. OpenAI: Introducing ChatGPT.
- OpenAI. 2023. OpenAI: GPT-4.
- Detecting and mitigating hallucinations in multilingual summarisation. arXiv preprint arXiv:2305.13632.
- Language models are unsupervised multitask learners.
- Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
- Fact or fiction: Verifying scientific claims. arXiv preprint arXiv:2004.14974.
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
- Karen Weise and Cade Metz. 2023. When ai chatbots hallucinate. The New York Times, page 4.
- Alignment for honesty. arXiv preprint arXiv:2312.07000.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469.
- Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794.
- Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Reducing quantity hallucinations in abstractive summarization. arXiv preprint arXiv:2009.13312.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
- Zhiying Zhu (9 papers)
- Zhiqing Sun (35 papers)
- Yiming Yang (151 papers)