Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Just ClozE! A Novel Framework for Evaluating the Factual Consistency Faster in Abstractive Summarization (2210.02804v2)

Published 6 Oct 2022 in cs.CL and cs.AI

Abstract: The issue of factual consistency in abstractive summarization has received extensive attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current evaluation metrics are adopted from the question answering (QA) or natural language inference (NLI) task. However, the application of QA-based metrics is extremely time-consuming in practice while NLI-based metrics are lack of interpretability. In this paper, we propose a cloze-based evaluation framework called ClozE and show the great potential of the cloze-based metric. It inherits strong interpretability from QA, while maintaining the speed of NLI- level reasoning. We demonstrate that ClozE can reduce the evaluation time by nearly 96% relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE (Gabriel et al., 2021). Finally, we discuss three important facets of ClozE in practice, which further shows better overall performance of ClozE compared to other metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.  In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72.
  2. Adversarial nli for factual correctness in text summarisation models.  arXiv preprint arXiv:2005.11739.
  3. Factual error correction for abstractive summarization models.  In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6251–6258.
  4. Faithful to the original: Fact aware neural abstractive summarization.  In thirty-second AAAI conference on artificial intelligence.
  5. Understanding the extent to which content quality metrics measure the information quality of summaries.  In Proceedings of the 25th Conference on Computational Natural Language Learning, pp. 300–309.
  6. Glm: General language model pretraining with autoregressive blank infilling.  In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335.
  7. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization.  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5055–5070.
  8. Summeval: Re-evaluating summarization evaluation.  Transactions of the Association for Computational Linguistics, 9, 391–409.
  9. Qafacteval: Improved qa-based factual consistency evaluation for summarization.  arXiv preprint arXiv:2112.08542.
  10. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference.  In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220.
  11. Go figure: A meta evaluation of factuality in summarization.  In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 478–487.
  12. Annotating and modeling fine-grained factuality in summarization.  In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1449–1462.
  13. Teaching machines to read and comprehend.  Advances in neural information processing systems, 28.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding.  In Proceedings of NAACL-HLT, pp. 4171–4186.
  15. Adam: A method for stochastic optimization.  arXiv preprint arXiv:1412.6980.
  16. Ffci: A framework for interpretable automatic evaluation of summarization.  Journal of Artificial Intelligence Research, 73, 1553–1607.
  17. Neural text summarization: A critical evaluation.  In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 540–551.
  18. Evaluating the factual consistency of abstractive text summarization.  In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9332–9346.
  19. Summac: Re-visiting nli-based models for inconsistency detection in summarization.  Transactions of the Association for Computational Linguistics, 10, 163–177.
  20. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880.
  21. Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries.  In Text summarization branches out, pp. 74–81.
  22. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.  arXiv preprint arXiv:2107.13586.
  23. Roberta: A robustly optimized bert pretraining approach.  arXiv preprint arXiv:1907.11692.
  24. On faithfulness and factuality in abstractive summarization.  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1906–1919.
  25. Improving factual consistency of abstractive summarization via question answering.  arXiv preprint arXiv:2105.04623.
  26. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.  In 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807. Association for Computational Linguistics.
  27. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics.  In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4812–4829.
  28. Bleu: a method for automatic evaluation of machine translation.  In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318.
  29. Evaluating machine common sense via cloze testing.  arXiv preprint arXiv:2201.07902.
  30. Comet: A neural framework for mt evaluation.  In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702.
  31. Questeval: Summarization asks for fact-based evaluation.  In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6594–6604. Association for Computational Linguistics.
  32. Answers unite! unsupervised metrics for reinforced summarization models.  In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3237–3247. Association for Computational Linguistics.
  33. Bleurt: Learning robust metrics for text generation.  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892.
  34. Taylor, W. L. (1953). “cloze procedure”: A new tool for measuring readability.  Journalism quarterly, 30(4), 415–433.
  35. Fill in the blanc: Human-free quality estimation of document summaries.  In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp. 11–20.
  36. Attention is all you need.  Advances in neural information processing systems, 30.
  37. Asking and answering questions to evaluate the factual consistency of summaries.  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5008–5020.
  38. Huggingface’s transformers: State-of-the-art natural language processing.  arXiv preprint arXiv:1910.03771.
  39. Bertscore: Evaluating text generation with bert.  In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yiyang Li (30 papers)
  2. Lei Li (1293 papers)
  3. Marina Litvak (5 papers)
  4. Natalia Vanetik (9 papers)
  5. Dingxin Hu (2 papers)
  6. Yuze Li (5 papers)
  7. Yanquan Zhou (3 papers)

Summary

We haven't generated a summary for this paper yet.