Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization (2402.13249v2)

Published 20 Feb 2024 in cs.CL and cs.AI

Abstract: Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

PDF HTML Abstract

Evaluation of LLMs on Topic-Focused Dialogue Summarization: A Study on Hallucinations

Introduction to TofuEval

The body of research surrounding LLMs and their applications in various text summarization tasks has been burgeoning, especially within the domain of news summarization. However, the exploration into dialogue summarization, a significant but less traversed area, remains limited in breadth. This paper introduces TofuEval, a novel benchmark dataset designed to assess the factual consistency of LLM-generated summaries focusing on dialogue summarization. TofuEval leverages summaries generated by LLMs of different sizes and conducts thorough analyses of these summaries' factual adherence, providing sentence-level binary human annotations for factual consistency along with explanations for any identified factual inconsistencies.

Hallucinations in LLM-generated Summaries

The paper's findings highlight a pervasive issue of hallucinations—factually incorrect inferences made by LLMs—within the domain of dialogue summarization. The analysis uncovers that LLMs, despite their sizes, are prone to introducing a significant volume of factual inaccuracies in their summaries. This contradicts the common assumption that larger models would inherently produce more accurate and factually consistent summaries. The analysis further elucidates that LLMs, including advanced versions like GPT-4, when tasked as binary factual consistency evaluators, demonstrate inadequate performance, underscored by the superiority of specialized state-of-the-art factuality evaluation metrics over LLM evaluators in terms of both accuracy and computational efficiency.

Comparative Analysis with Existing Benchmarks

Contrary to other benchmarks which predominantly focus on news summarization, TofuEval is dedicated to evaluating dialogue summarization across various dialogic contexts, encompassing interviews and meetings. This focus stems from the potential applicability of dialogue summarization in real-world scenarios such as streamlining customer service interactions and enhancing efficiency in meetings. The paper effectively situates TofuEval within the landscape of existing benchmarks, highlighting its unique contributions including the provision of expert-annotated factual consistency labels with written explanations, thereby offering a comprehensive framework for assessing summary factuality in dialogue contexts.

Error Taxonomy and Analysis

An innovative aspect of the paper is the development and application of a detailed error taxonomy tailored for the dialogue summarization domain. This taxonomy facilitates a nuanced analysis of the types and distributions of errors across model-generated summaries, revealing diverse patterns of factual inaccuracies. Through this taxonomy, the paper identifies specific areas where non-LLM based metrics excel in capturing error types more effectively than their LLM-based counterparts, thus providing valuable insights for future improvements in model evaluators and summarization techniques.

Implications and Future Directions

The implications of this research are twofold. Practically, it underscores the urgent need for enhancing the factual consistency of LLM-generated summaries in dialogue summarization, urging the development of models and evaluation metrics that can better contend with the complexities inherent in dialogic texts. Theoretically, it contributes to our understanding of the limitations and capabilities of LLMs across varying dimensions of text summarization, challenging the notion of a one-size-fits-all model proficiency. Looking ahead, the paper behooves the research community to explore the exploration of specialized models and metrics tailored for dialogue summarization, potentially leading to advancements in AI's applicability across more nuanced and context-rich text summarization tasks.

Conclusion

TofuEval stands as a critical endeavor in the exploration of LLM performance on dialogue summarization, highlighting the prevalence of hallucinations in model-generated summaries and the current inadequacy of LLMs as reliable evaluators of factual consistency. By providing a robust benchmark and a detailed analysis of the errors present in summaries, this paper lays foundational groundwork for future advancements in the field, aiming towards the development of more accurate, efficient, and contextually aware summarization and evaluation models.

PDF Markdown Bookmark Chat (Pro)

References (40)

Authors (14)

Liyan Tang (12 papers)
Igor Shalyminov (20 papers)
Amy Wing-mei Wong (2 papers)
Jon Burnsky (3 papers)
Jake W. Vincent (2 papers)
Yu'an Yang (1 paper)
Siffi Singh (7 papers)
Song Feng (43 papers)
Hwanjun Song (44 papers)
Hang Su (224 papers)
Lijia Sun (4 papers)
Yi Zhang (994 papers)
Saab Mansour (32 papers)
Kathleen McKeown (85 papers)

Citations (30)

View on Semantic Scholar

Tweets

https://twitter.com/LiyanTang4/status/1760328350482968965

https://twitter.com/_akhaliq/status/1760147668905431401

https://twitter.com/javaeeeee1/status/1761747775035703630