Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DocAsRef: An Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely (2212.10013v2)

Published 20 Dec 2022 in cs.AI and cs.CL

Abstract: Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  2. SueNes: A weakly supervised approach to evaluating single-document summarization via negative sampling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2450–2458, Seattle, United States. Association for Computational Linguistics.
  3. SummEval: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  4. Human-like summarization evaluation with ChatGPT.
  5. SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1347–1354, Online. Association for Computational Linguistics.
  6. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
  7. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  8. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  9. G-Eval: NLG evaluation using GPT-4 with better human alignment.
  10. Reference-free summarization evaluation via semantic correlation and compression ratio. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2109–2115, Seattle, United States. Association for Computational Linguistics.
  11. Ananya Mukherjee and Manish Shrivastava. 2022. REUSE: REference-free UnSupervised quality estimation metric. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 564–568, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  12. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing (TSLP), 4(2):4–es.
  13. NIST. 2010. TAC2010 guided summarization competition. https://tac.nist.gov/2010/Summarization/Guided-Summ.2010.guidelines.html. Accessed: 2021-08-16.
  14. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84, Copenhagen, Denmark. Association for Computational Linguistics.
  15. Answers unite! unsupervised metrics for reinforced summarization models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3246–3256, Hong Kong, China. Association for Computational Linguistics.
  16. BLEURT: Learning robust metrics for text generation. pages 7881–7892, Online. Association for Computational Linguistics.
  17. Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11–20, Online. Association for Computational Linguistics.
  18. Is ChatGPT a good NLG evaluator? a preliminary study.
  19. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
  20. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Forrest Sheng Bao (16 papers)
  2. Ruixuan Tu (3 papers)
  3. Ge Luo (8 papers)
  4. Yinfei Yang (73 papers)
  5. Hebi Li (5 papers)
  6. Minghui Qiu (58 papers)
  7. Youbiao He (7 papers)
  8. Cen Chen (81 papers)
Citations (2)