Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization (2402.13249v2)

Published 20 Feb 2024 in cs.CL and cs.AI
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Abstract: Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

Evaluation of LLMs on Topic-Focused Dialogue Summarization: A Study on Hallucinations

Introduction to TofuEval

The body of research surrounding LLMs and their applications in various text summarization tasks has been burgeoning, especially within the domain of news summarization. However, the exploration into dialogue summarization, a significant but less traversed area, remains limited in breadth. This paper introduces TofuEval, a novel benchmark dataset designed to assess the factual consistency of LLM-generated summaries focusing on dialogue summarization. TofuEval leverages summaries generated by LLMs of different sizes and conducts thorough analyses of these summaries' factual adherence, providing sentence-level binary human annotations for factual consistency along with explanations for any identified factual inconsistencies.

Hallucinations in LLM-generated Summaries

The paper's findings highlight a pervasive issue of hallucinations—factually incorrect inferences made by LLMs—within the domain of dialogue summarization. The analysis uncovers that LLMs, despite their sizes, are prone to introducing a significant volume of factual inaccuracies in their summaries. This contradicts the common assumption that larger models would inherently produce more accurate and factually consistent summaries. The analysis further elucidates that LLMs, including advanced versions like GPT-4, when tasked as binary factual consistency evaluators, demonstrate inadequate performance, underscored by the superiority of specialized state-of-the-art factuality evaluation metrics over LLM evaluators in terms of both accuracy and computational efficiency.

Comparative Analysis with Existing Benchmarks

Contrary to other benchmarks which predominantly focus on news summarization, TofuEval is dedicated to evaluating dialogue summarization across various dialogic contexts, encompassing interviews and meetings. This focus stems from the potential applicability of dialogue summarization in real-world scenarios such as streamlining customer service interactions and enhancing efficiency in meetings. The paper effectively situates TofuEval within the landscape of existing benchmarks, highlighting its unique contributions including the provision of expert-annotated factual consistency labels with written explanations, thereby offering a comprehensive framework for assessing summary factuality in dialogue contexts.

Error Taxonomy and Analysis

An innovative aspect of the paper is the development and application of a detailed error taxonomy tailored for the dialogue summarization domain. This taxonomy facilitates a nuanced analysis of the types and distributions of errors across model-generated summaries, revealing diverse patterns of factual inaccuracies. Through this taxonomy, the paper identifies specific areas where non-LLM based metrics excel in capturing error types more effectively than their LLM-based counterparts, thus providing valuable insights for future improvements in model evaluators and summarization techniques.

Implications and Future Directions

The implications of this research are twofold. Practically, it underscores the urgent need for enhancing the factual consistency of LLM-generated summaries in dialogue summarization, urging the development of models and evaluation metrics that can better contend with the complexities inherent in dialogic texts. Theoretically, it contributes to our understanding of the limitations and capabilities of LLMs across varying dimensions of text summarization, challenging the notion of a one-size-fits-all model proficiency. Looking ahead, the paper behooves the research community to explore the exploration of specialized models and metrics tailored for dialogue summarization, potentially leading to advancements in AI's applicability across more nuanced and context-rich text summarization tasks.

Conclusion

TofuEval stands as a critical endeavor in the exploration of LLM performance on dialogue summarization, highlighting the prevalence of hallucinations in model-generated summaries and the current inadequacy of LLMs as reliable evaluators of factual consistency. By providing a robust benchmark and a detailed analysis of the errors present in summaries, this paper lays foundational groundwork for future advancements in the field, aiming towards the development of more accurate, efficient, and contextually aware summarization and evaluation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. CLIFF: Contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6633–6649, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.
  3. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
  4. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  5. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
  6. Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.
  7. GPTScore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  8. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  9. Mingqi Gao and Xiaojun Wan. 2022. DialSummEval: Revisiting summarization evaluation for dialogues. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5693–5709, Seattle, United States. Association for Computational Linguistics.
  10. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
  11. Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462, Online. Association for Computational Linguistics.
  12. News summarization and evaluation in the era of GPT-3. arXiv preprint arXiv:2209.12356.
  13. An Introduction to Functional Grammar. Routledge.
  14. MeetingBank: A benchmark dataset for meeting summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16409–16423, Toronto, Canada. Association for Computational Linguistics.
  15. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, Hong Kong, China. Association for Computational Linguistics.
  16. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  17. SummEdits: Measuring LLM ability at factual reasoning through the lens of summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, Singapore. Association for Computational Linguistics.
  18. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  19. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  20. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  21. ChatGPT as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621.
  22. Self-refine: Iterative refinement with self-feedback.
  23. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  24. Marry L. McHugh. 2012. Interrater reliability: The kappa statistic. Biochemia Medica, pages 276–282.
  25. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
  26. OpenAI. 2023. GPT-4 technical report. ArXiv, abs/2303.08774.
  27. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  28. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  29. Self-critiquing models for assisting human evaluators. CoRR, abs/2206.05802.
  30. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.
  31. Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1).
  32. CONFIT: Toward faithful dialogue summarization with linguistically-informed contrastive fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5657–5668, Seattle, United States. Association for Computational Linguistics.
  33. LLaMA: Open and efficient foundation language models.
  34. Is ChatGPT a good NLG evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  35. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  36. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  37. AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  38. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
  39. Benchmarking large language models for news summarization. ArXiv, abs/2301.13848.
  40. MediaSum: A large-scale media interview dataset for dialogue summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5927–5934, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Liyan Tang (12 papers)
  2. Igor Shalyminov (20 papers)
  3. Amy Wing-mei Wong (2 papers)
  4. Jon Burnsky (3 papers)
  5. Jake W. Vincent (2 papers)
  6. Yu'an Yang (1 paper)
  7. Siffi Singh (7 papers)
  8. Song Feng (43 papers)
  9. Hwanjun Song (44 papers)
  10. Hang Su (224 papers)
  11. Lijia Sun (4 papers)
  12. Yi Zhang (994 papers)
  13. Saab Mansour (32 papers)
  14. Kathleen McKeown (85 papers)
Citations (30)