Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains (2402.03509v1)
Abstract: Recent work has shown that LLMs are capable of generating summaries zero-shot (i.e., without explicit supervision) that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains? In this work we evaluate zero-shot generated summaries across specialized domains including biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles. The dataset can be downloaded from https://github.com/sanjanaramprasad/zero_shot_faceval_domains
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
- Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. arXiv preprint arXiv:2109.09784.
- Cliff: Contrastive learning for improving faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2109.09209.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- A discourse-aware attention model for abstractive summarization of long documents.
- Qafacteval: Improved qa-based factual consistency evaluation for summarization. arXiv preprint arXiv:2112.08542.
- Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. arXiv preprint arXiv:2010.05478.
- Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462, Online. Association for Computational Linguistics.
- News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
- What have we achieved on text summarization? arXiv preprint arXiv:2010.04529.
- Anastassia Kornilova and Vlad Eidelman. 2019. Billsum: A corpus for automatic summarization of us legislation.
- Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840.
- Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- Chatgpt as a factual inconsistency evaluator for text summarization.
- On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
- Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693.
- Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success). arXiv preprint arXiv:2305.06299.
- Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. arXiv preprint arXiv:2205.12854.
- Evaluating large language models on medical evidence summarization. medRxiv, pages 2023–04.
- Asking and answering questions to evaluate the factual consistency of summaries. arXiv preprint arXiv:2004.04228.
- Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
- Enhancing factual consistency of abstractive summarization. arXiv preprint arXiv:2003.08612.