Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

84 3

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models (2404.05904v2)

Published 8 Apr 2024 in cs.CL

Abstract: LLMs have transformed the NLP landscape with their remarkable ability to understand and generate human-like text. However, these models are prone to ``hallucinations'' -- outputs that do not align with factual reality or the input context. This paper introduces the Hallucinations Leaderboard, an open initiative to quantitatively measure and compare the tendency of each model to produce hallucinations. The leaderboard uses a comprehensive set of benchmarks focusing on different aspects of hallucinations, such as factuality and faithfulness, across various tasks, including question-answering, summarisation, and reading comprehension. Our analysis provides insights into the performance of different models, guiding researchers and practitioners in choosing the most reliable models for their applications.

PDF HTML Abstract

The Hallucinations Leaderboard: Evaluating Hallucination Tendencies in LLMs

The proliferation of LLMs has fundamentally altered the landscape of NLP with their capabilities for language generation and few-shot learning. However, these models are prone to generating outputs that may not align with factuality or the provided context, a phenomenon termed "hallucinations." The paper "The Hallucinations Leaderboard — An Open Effort to Measure Hallucinations in LLMs" introduces a platform aimed at addressing this issue through the evaluation of LLMs' hallucination tendencies. This essay provides a comprehensive evaluation of the paper, highlighting its methodologies, findings, and implications for future AI research.

Motivation and Framework

The motivation for developing the Hallucinations Leaderboard arises from the challenges imposed by hallucinations in LLMs, which pose significant limitations on their reliability across various applications. The paper identifies two primary forms of hallucinations: factuality and faithfulness. Factuality refers to the correctness of the information produced by LLMs, whereas faithfulness pertains to the model's adherence to the given source context.

To evaluate these dimensions, the authors employ a diverse set of tasks categorized into factuality and faithfulness. The evaluation framework leverages the EleutherAI LLM Evaluation Harness, ensuring a structured approach to in-context learning via zero-shot and few-shot scenarios.

Analysis and Results

The analysis conducted in the paper evaluates 20 LLMs across 15 tasks, aimed at gauging factuality and faithfulness in a range of applications such as question answering, summarization, and reading comprehension. Interestingly, the results indicate variance across different models and tasks. A key observation is that LLMs exhibit better capabilities in evaluating factuality and faithfulness internally than in generating factually and contextually accurate responses.

The paper also explores the effects of model scale and fine-tuning. Larger models tend to perform better in factuality tasks, corroborated by a notable increase in scores with increasing model size. Moreover, the research shows that instruction fine-tuning generally improves faithfulness, enhancing the models' capability to adhere to contexts, although this does not always translate into improved factuality.

Implications and Future Directions

The Hallucinations Leaderboard provides an instrumental platform for understanding and mitigating hallucination tendencies in LLMs. This initiative holds significant implications for enhancing the reliability and efficacy of LLMs in real-world applications, facilitating the selection of more reliable models by researchers and practitioners. Moreover, the paper advocates for community contributions to the evolving leaderboard, suggesting potential paths for continuous improvement.

The paper opens several avenues for future research. One area of investigation could involve further exploring the trade-offs between instruction fine-tuning and factuality improvements. The influence of prompt templates and shot examples on hallucination tendencies also merits deeper exploration. Additionally, extending the scope of evaluated models to include proprietary black-box models like GPT-4 could provide a more robust comparison of LLM tendencies.

Conclusion

In conclusion, the Hallucinations Leaderboard represents a critical step toward addressing the hallucination challenges in LLMs. Through comprehensive evaluation and collective insights, it paves the way for enhanced LLM development and application, fostering advancements in NLP while highlighting the importance of community-driven efforts in AI research. As the landscape of AI continues to evolve, tools such as the Hallucinations Leaderboard will be invaluable in ensuring that LLMs are equipped to navigate complex, real-world scenarios with greater factual and contextual fidelity.

PDF Markdown Bookmark Chat (Pro)

References (68)

Authors (11)

Giwon Hong (10 papers)
Aryo Pradipta Gema (18 papers)
Rohit Saxena (11 papers)
Xiaotang Du (4 papers)
Ping Nie (23 papers)
Yu Zhao (207 papers)
Laura Perez-Beltrachini (14 papers)
Max Ryabinin (29 papers)
Xuanli He (43 papers)
Pasquale Minervini (88 papers)
Clémentine Fourrier (9 papers)

Citations (3)

View on Semantic Scholar

Tweets

https://twitter.com/GiwonHong413849/status/1783560836843217395

https://twitter.com/maximelabonne/status/1826946081323716745

https://twitter.com/PeterJ_Lawrence/status/1829997739817512986

YouTube

Show All Videos