Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences (2212.09726v2)

Published 19 Dec 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20\% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Nihat Ay and Daniel Polani. 2008. Information flows in causal networks. Advances in complex systems, 11(01):17–41.
  2. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  3. Evaluation of text generation: A survey. CoRR, abs/2006.14799.
  4. Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–686, Melbourne, Australia. Association for Computational Linguistics.
  5. Gsum: A general framework for guided neural abstractive summarization. arXiv preprint arXiv:2010.08014.
  6. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070.
  7. Answersumm: A manually-curated dataset and pipeline for answer summarization.
  8. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220.
  9. GO FIGURE: A meta evaluation of factuality in summarization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 478–487, Online. Association for Computational Linguistics.
  10. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719, New Orleans, Louisiana. Association for Computational Linguistics.
  11. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  12. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  13. EASE: Extractive-abstractive summarization end-to-end using the information bottleneck principle. In Proceedings of the Third Workshop on New Frontiers in Summarization, pages 85–95, Online and in Dominican Republic. Association for Computational Linguistics.
  14. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  15. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  16. Dyle: Dynamic latent extraction for abstractive long-input summarization. arXiv preprint arXiv:2110.08168.
  17. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
  18. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  19. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
  20. Judea Pearl. 2012. The causal foundations of structural equation modeling. Technical report, California Univ Los Angeles Dept of Computer Science.
  21. Maxime Peyrard. 2019. A simple theoretical model of importance for summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1059–1073.
  22. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
  23. Confit: Toward faithful dialogue summarization with linguistically-informed contrastive fine-tuning. arXiv preprint arXiv:2112.08713.
  24. Factual consistency evaluation for text summarization via counterfactual estimation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 100–110.
  25. Bartscore: Evaluating generated text as text generation.
  26. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations.
  27. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
  28. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  29. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Asish Ghoshal (14 papers)
  2. Arash Einolghozati (21 papers)
  3. Ankit Arun (2 papers)
  4. Haoran Li (166 papers)
  5. Lili Yu (28 papers)
  6. Vera Gor (2 papers)
  7. Yashar Mehdad (37 papers)
  8. Scott Wen-tau Yih (5 papers)
  9. Asli Celikyilmaz (80 papers)
Citations (1)