Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models (2404.05904v2)

Published 8 Apr 2024 in cs.CL
The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Abstract: LLMs have transformed the NLP landscape with their remarkable ability to understand and generate human-like text. However, these models are prone to ``hallucinations'' -- outputs that do not align with factual reality or the input context. This paper introduces the Hallucinations Leaderboard, an open initiative to quantitatively measure and compare the tendency of each model to produce hallucinations. The leaderboard uses a comprehensive set of benchmarks focusing on different aspects of hallucinations, such as factuality and faithfulness, across various tasks, including question-answering, summarisation, and reading comprehension. Our analysis provides insights into the performance of different models, guiding researchers and practitioners in choosing the most reliable models for their applications.

The Hallucinations Leaderboard: Evaluating Hallucination Tendencies in LLMs

The proliferation of LLMs has fundamentally altered the landscape of NLP with their capabilities for language generation and few-shot learning. However, these models are prone to generating outputs that may not align with factuality or the provided context, a phenomenon termed "hallucinations." The paper "The Hallucinations Leaderboard — An Open Effort to Measure Hallucinations in LLMs" introduces a platform aimed at addressing this issue through the evaluation of LLMs' hallucination tendencies. This essay provides a comprehensive evaluation of the paper, highlighting its methodologies, findings, and implications for future AI research.

Motivation and Framework

The motivation for developing the Hallucinations Leaderboard arises from the challenges imposed by hallucinations in LLMs, which pose significant limitations on their reliability across various applications. The paper identifies two primary forms of hallucinations: factuality and faithfulness. Factuality refers to the correctness of the information produced by LLMs, whereas faithfulness pertains to the model's adherence to the given source context.

To evaluate these dimensions, the authors employ a diverse set of tasks categorized into factuality and faithfulness. The evaluation framework leverages the EleutherAI LLM Evaluation Harness, ensuring a structured approach to in-context learning via zero-shot and few-shot scenarios.

Analysis and Results

The analysis conducted in the paper evaluates 20 LLMs across 15 tasks, aimed at gauging factuality and faithfulness in a range of applications such as question answering, summarization, and reading comprehension. Interestingly, the results indicate variance across different models and tasks. A key observation is that LLMs exhibit better capabilities in evaluating factuality and faithfulness internally than in generating factually and contextually accurate responses.

The paper also explores the effects of model scale and fine-tuning. Larger models tend to perform better in factuality tasks, corroborated by a notable increase in scores with increasing model size. Moreover, the research shows that instruction fine-tuning generally improves faithfulness, enhancing the models' capability to adhere to contexts, although this does not always translate into improved factuality.

Implications and Future Directions

The Hallucinations Leaderboard provides an instrumental platform for understanding and mitigating hallucination tendencies in LLMs. This initiative holds significant implications for enhancing the reliability and efficacy of LLMs in real-world applications, facilitating the selection of more reliable models by researchers and practitioners. Moreover, the paper advocates for community contributions to the evolving leaderboard, suggesting potential paths for continuous improvement.

The paper opens several avenues for future research. One area of investigation could involve further exploring the trade-offs between instruction fine-tuning and factuality improvements. The influence of prompt templates and shot examples on hallucination tendencies also merits deeper exploration. Additionally, extending the scope of evaluated models to include proprietary black-box models like GPT-4 could provide a more robust comparison of LLM tendencies.

Conclusion

In conclusion, the Hallucinations Leaderboard represents a critical step toward addressing the hallucination challenges in LLMs. Through comprehensive evaluation and collective insights, it paves the way for enhanced LLM development and application, fostering advancements in NLP while highlighting the importance of community-driven efforts in AI research. As the landscape of AI continues to evolve, tools such as the Hallucinations Leaderboard will be invaluable in ensuring that LLMs are equipped to navigate complex, real-world scenarios with greater factual and contextual fidelity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  5. BigScience Workshop et al. 2022. Bloom: A 176b-parameter open-access multilingual language model.
  6. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  8. FELM: Benchmarking factuality evaluation of large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  9. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations.
  10. LM vs LM: Detecting factual errors via cross examination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12621–12640, Singapore. Association for Computational Linguistics.
  11. Faithdial: A faithful benchmark for information-seeking dialogue. arXiv preprint, arXiv:2204.10757.
  12. FactKB: Generalizable factuality evaluation using language models enhanced with factual knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952, Singapore. Association for Computational Linguistics.
  13. A framework for few-shot language model evaluation.
  14. Benjamin Heinzerling and Kentaro Inui. 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1772–1791, Online. Association for Computational Linguistics.
  15. Dirk Hovy and Shrimai Prabhumoye. 2021. Five sources of bias in natural language processing. Language and Linguistics Compass, 15(8):e12432.
  16. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232.
  17. Simon Hughes and Minseok Bae. 2023. Vectara hallucination leaderboard.
  18. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  19. Mistral 7b.
  20. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics, 8:423–438.
  21. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, page arXiv:1705.03551.
  22. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  23. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  24. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  25. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Computational Linguistics.
  26. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics.
  27. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
  28. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems.
  29. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  30. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  31. Alisa Liu and Jiacheng Liu. 2023. The memotrap dataset.
  32. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
  33. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  34. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL (1), pages 9802–9822. Association for Computational Linguistics.
  35. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  36. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  37. The inverse scaling prize.
  38. State of what art? a call for multi-prompt llm evaluation.
  39. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
  40. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  41. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  42. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  43. Language models are unsupervised multitask learners. In Technical report, OpenAi.
  44. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS.
  45. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  46. FactGraph: Evaluating factuality in summarization with semantic graph representations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3238–3253, Seattle, United States. Association for Computational Linguistics.
  47. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  48. Tara Safavi and Danai Koutra. 2021. Relational World Knowledge Representation in Contextual Language Models: A Review. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1053–1067, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  49. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 394–412, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  50. QuestEval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  51. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.
  52. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  53. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  55. Zephyr: Direct distillation of lm alignment.
  56. Falsesum: Generating document-level NLI examples for recognizing factual inconsistency in summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2763–2776, Seattle, United States. Association for Computational Linguistics.
  57. Mind your format: Towards consistent evaluation of in-context learning improvements.
  58. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  59. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.
  60. Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244.
  61. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  62. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, pages 2369–2380. Association for Computational Linguistics.
  63. Cognitive mirage: A review of hallucinations in large language models.
  64. SAC33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15445–15458, Singapore. Association for Computational Linguistics.
  65. How language model hallucinations can snowball. ArXiv, abs/2305.13534.
  66. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  67. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  68. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Giwon Hong (10 papers)
  2. Aryo Pradipta Gema (18 papers)
  3. Rohit Saxena (11 papers)
  4. Xiaotang Du (4 papers)
  5. Ping Nie (23 papers)
  6. Yu Zhao (207 papers)
  7. Laura Perez-Beltrachini (14 papers)
  8. Max Ryabinin (29 papers)
  9. Xuanli He (43 papers)
  10. Pasquale Minervini (88 papers)
  11. Clémentine Fourrier (9 papers)
Citations (3)
Youtube Logo Streamline Icon: https://streamlinehq.com