HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs (2402.16211v1)
Abstract: Hallucinations pose a significant challenge to the reliability and alignment of LLMs, limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any LLM for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
- Autohall: Automated hallucination dataset generation for large language models. arXiv preprint arXiv:2310.00259.
- Felm: Benchmarking factuality evaluation of large language models. arXiv preprint arXiv:2310.00741.
- Unveiling the siren’s song: Towards reliable fact-conflicting hallucination detection. arXiv preprint arXiv:2310.12086.
- Language modeling is compression. arXiv preprint arXiv:2309.10668.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–122.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
- Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32.
- Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944.
- Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38.
- Prospector: Improving LLM agents with self-asking and trajectory ranking. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
- Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
- Holistic evaluation of language models. Transactions on Machine Learning Research.
- Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. arXiv preprint arXiv:2311.15296.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Reason for future, act for now: A principled architecture for autonomous LLM agents. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Synthetic imitation edit feedback for factual alignment in clinical summarization. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI.
- Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045.
- OpenAI. 2023a. Chatgpt (mar 14 version) [large language model].
- OpenAI. 2023b. Openai chat completions api [gpt-3-5-turbo-0613].
- OpenAI. 2023c. Openai chat completions api [gpt-4-1106-preview].
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
- The troubling emergence of hallucination in large language models–an extensive definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2310.04988.
- Delucionqa: Detecting hallucinations in domain-specific question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 822–835.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Reflexion: Language agents with verbal reinforcement learning.
- Prompting GPT-3 to be reliable. In The Eleventh International Conference on Learning Representations.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
- A new benchmark and reverse validation method for passage-level hallucination detection. arXiv preprint arXiv:2310.06498.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems, volume 35, pages 30378–30392. Curran Associates, Inc.
- Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada. Association for Computational Linguistics.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Cem Uluoglakci (1 paper)
- Tugba Taskaya Temizel (2 papers)