Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild (2403.04307v3)

Published 7 Mar 2024 in cs.CL

Abstract: Hallucinations pose a significant challenge to the reliability of LLMs in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/HaluEval-Wild/HaluEval-Wild.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  3. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016. Proceedings 38, pages 716–722. Springer.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Canyu Chen and Kai Shu. 2023a. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788.
  6. Canyu Chen and Kai Shu. 2023b. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656.
  7. Learningq: a large-scale dataset for educational question generation. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12.
  8. Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. arXiv preprint arXiv:2310.07289.
  9. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. arXiv preprint arXiv:2312.17484.
  10. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  12. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  13. Diving deep into modes of fact hallucinations in dialogue systems. arXiv preprint arXiv:2301.04449.
  14. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  15. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
  16. On the origin of hallucinations in conversational models: Is it the datasets or the models? arXiv preprint arXiv:2204.07931.
  17. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
  18. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  19. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  20. Towards mitigating hallucination in large language models via self-reflection. arXiv preprint arXiv:2310.06271.
  21. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  22. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
  23. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  24. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  25. Ever: Mitigating hallucination in large language models through real-time verification and rectification. arXiv preprint arXiv:2311.09114.
  26. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(1):170.
  27. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  28. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  29. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  30. The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205.
  31. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
  32. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
  33. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  34. Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. arXiv preprint arXiv:2311.15296.
  35. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  36. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
  37. Michelle M Mello and Neel Guha. 2023. Chatgpt and physicians’ malpractice risk. In JAMA Health Forum, volume 4, pages e231938–e231938. American Medical Association.
  38. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 845–854.
  39. OpenAI. 2022. OpenAI: Introducing ChatGPT.
  40. OpenAI. 2023. OpenAI: GPT-4.
  41. Detecting and mitigating hallucinations in multilingual summarisation. arXiv preprint arXiv:2305.13632.
  42. Language models are unsupervised multitask learners.
  43. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
  44. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
  45. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  46. Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
  47. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  48. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343.
  51. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
  52. Fact or fiction: Verifying scientific claims. arXiv preprint arXiv:2004.14974.
  53. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
  54. Karen Weise and Cade Metz. 2023. When ai chatbots hallucinate. The New York Times, page 4.
  55. Alignment for honesty. arXiv preprint arXiv:2312.07000.
  56. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  57. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469.
  58. Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794.
  59. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  60. Reducing quantity hallucinations in abstractive summarization. arXiv preprint arXiv:2009.13312.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  62. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593.
  63. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhiying Zhu (9 papers)
  2. Zhiqing Sun (35 papers)
  3. Yiming Yang (151 papers)
Citations (9)