Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What's under the hood: Investigating Automatic Metrics on Meeting Summarization (2404.11124v2)

Published 17 Apr 2024 in cs.CL and cs.AI

Abstract: Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores with human evaluations across a broad error taxonomy. We commence with a comprehensive literature review on English meeting summarization to define key challenges like speaker dynamics and contextual turn-taking and error types such as missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We examine the relationship between characteristic challenges and errors by using annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset. Through experimental validation, we find that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations, while a third of the correlations show trends of error masking. Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE? In Findings of the Association for Computational Linguistics: ACL 2022, pages 1547–1560, Dublin, Ireland. Association for Computational Linguistics.
  2. A Survey of Advanced Methods for Efficient Text Summarization. In 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), pages 0962–0968.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  4. J Beall. 2016. Best practices for scholarly authors in the age of predatory journals. The Annals of The Royal College of Surgeons of England, 98(2):77–79. PMID: 26829665.
  5. Longformer: The Long-Document Transformer.
  6. Prompted Opinion Summarization with GPT-3.5. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9282–9300, Toronto, Canada. Association for Computational Linguistics.
  7. Jiaao Chen and Diyi Yang. 2020a. Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization.
  8. Jiaao Chen and Diyi Yang. 2020b. Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4106–4118, Online. Association for Computational Linguistics.
  9. Arman Cohan and Nazli Goharian. 2016. Revisiting Summarization Evaluation for Scientific Articles. pages 806–813.
  10. SummEval: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  11. Jody Condit Fagan. 2017. An evidence-based review of academic web search engines, 2014-2016: Implications for librarians’ practice and research agenda. Information Technology and Libraries, 36(2):7–47.
  12. A Survey on Dialogue Summarization: Recent Advances and New Frontiers.
  13. Mingqi Gao and Xiaojun Wan. 2022. DialSummEval: Revisiting Summarization Evaluation for Dialogues. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5693–5709, Seattle, United States. Association for Computational Linguistics.
  14. Sian Gooding. 2022. On the Ethical Considerations of Text Simplification. In Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), pages 50–57, Dublin, Ireland. Association for Computational Linguistics.
  15. Who Says What to Whom: A Survey of Multi-Party Conversations. In Thirty-First International Joint Conference on Artificial Intelligence, volume 6, pages 5486–5493.
  16. Michael Hanna and Ondřej Bojar. 2021. A Fine-Grained Analysis of BERTScore. In Proceedings of the Sixth Conference on Machine Translation, pages 507–517, Online. Association for Computational Linguistics.
  17. Improving Word Representations via Global Context and Multiple Word Prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873–882, Jeju Island, Korea. Association for Computational Linguistics.
  18. Efficient Long-Text Understanding with Short-Text Models.
  19. Meeting Summarization, A Challenge for Deep Learning. In Ignacio Rojas, Gonzalo Joya, and Andreu Catala, editors, Advances in Computational Intelligence, volume 11506, pages 644–655. Springer International Publishing, Cham.
  20. The ICSI meeting corpus. pages I–364.
  21. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):248:1–248:38.
  22. A Bag of Tricks for Dialogue Summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8014–8022, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  23. How Domain Terminology Affects Meeting Summarization Performance. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5689–5695, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  24. Lakshmi Prasanna Kumar and Arman Kabiri. 2022. Meeting Summarization: A Survey of the State of the Art.
  25. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  26. Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2562–2579, Dubrovnik, Croatia. Association for Computational Linguistics.
  27. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  28. BLANC: learning evaluation metrics for MT. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT ’05, pages 740–747, Vancouver, British Columbia, Canada. Association for Computational Linguistics.
  29. Feifan Liu and Yang Liu. 2008. Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries. In Proceedings of ACL-08: HLT, Short Papers, pages 201–204, Columbus, Ohio. Association for Computational Linguistics.
  30. Coreference-Aware Dialogue Summarization. ArXiv:2106.08556 [cs].
  31. Multi-document summarization via deep learning techniques: A survey. ACM Comput. Surv., 55(5).
  32. LENS: A Learnable Evaluation Metric for Text Simplification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16383–16408, Toronto, Canada. Association for Computational Linguistics.
  33. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  34. The AMI meeting corpus.
  35. Do We Really Need Another Meeting? The Science of Workplace Meetings. 27:096372141877630.
  36. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  37. Investigating Efficiently Extending Transformers for Long Input Summarization. ArXiv:2208.04347 [cs].
  38. The Trend in Using Online Meeting Applications for Learning During the Period of Pandemic COVID-19: A Literature Review. 1:58–68.
  39. Language Models are Unsupervised Multitask Learners.
  40. Abstractive Meeting Summarization: A Survey.
  41. Hadeel Saadany and Constantin Orasan. 2021. BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text. In Proceedings of the Translation and Interpreting Technology Online Conference, pages 48–56, Held Online. INCOMA Ltd.
  42. QuestEval: Summarization Asks for Fact-based Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  43. Automatic Minuting: A Pipeline Method for Generating Minutes from Multi-Party Meeting Proceedings. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, pages 691–702, Manila, Philippines. De La Salle University.
  44. CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5657–5668, Seattle, United States. Association for Computational Linguistics.
  45. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv:2307.09288 [cs].
  46. Analyzing and Evaluating Faithfulness in Dialogue Summarization.
  47. Analyzing and Evaluating Faithfulness in Dialogue Summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4897–4908, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  48. Plan and generate: Explicit and implicit variational augmentation for multi-document summarization of scientific articles. Information Processing & Management, 60(4):103409.
  49. Adjacency Pairs-Aware Hierarchical Attention Networks for Dialogue Intent Classification. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7622–7626. ISSN: 2379-190X.
  50. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, pages 11328–11339. JMLR.org.
  51. BERTScore: Evaluating Text Generation with BERT.
  52. <span style="font-variant:small-caps;">MACSum</span> : Controllable Summarization with Mixed Attributes. Transactions of the Association for Computational Linguistics, 11:787–803.
  53. An Exploratory Study on Long Dialogue Summarization: What Works and What’s Next.
  54. Zhuosheng Zhang and Hai Zhao. 2021. Advances in Multi-turn Dialogue Comprehension: A Survey.
  55. DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization.
  56. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.
  57. Towards Understanding Omission in Dialogue Summarization.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Frederic Kirstein (8 papers)
  2. Jan Philip Wahle (31 papers)
  3. Terry Ruas (46 papers)
  4. Bela Gipp (98 papers)
Citations (3)