Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark (2402.14359v1)

Published 22 Feb 2024 in cs.CL

Abstract: The summarization capabilities of pretrained and LLMs have been widely validated in general areas, but their use in scientific corpus, which involves complex sentences and specialized knowledge, has been less assessed. This paper presents conceptual and experimental analyses of scientific summarization, highlighting the inadequacies of traditional evaluation methods, such as $n$-gram, embedding comparison, and QA, particularly in providing explanations, grasping scientific concepts, or identifying key content. Subsequently, we introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries based on different aspects. This facet-aware approach offers a thorough evaluation of abstracts by decomposing the evaluation task into simpler subtasks.Recognizing the absence of an evaluation benchmark in this domain, we curate a Facet-based scientific summarization Dataset (FD) with facet-level annotations. Our findings confirm that FM offers a more logical approach to evaluating scientific summaries. In addition, fine-tuned smaller models can compete with LLMs in scientific contexts, while LLMs have limitations in learning from in-context information in scientific domains. This suggests an area for future enhancement of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Enhancing scientific papers summarization with citation graph. In Proc. of AAAI.
  2. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
  3. Re-evaluating evaluation in text summarization. In Proc. of EMNLP.
  4. Iterative document representation learning towards summarization with polishing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4088–4097.
  5. A discourse-aware attention model for abstractive summarization of long documents. In Proc. of AACL.
  6. Arman Cohan and Nazli Goharian. 2016. Revisiting summarization evaluation for scientific articles. Proc. of LREC.
  7. Arman Cohan and Nazli Goharian. 2018. Scientific document summarization via citation contextualization and scientific discourse. International Journal on Digital Libraries.
  8. Franck Dernoncourt and Ji-Young Lee. 2017. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. In Proc. of EMNLP.
  9. Msˆ2: Multi-document summarization of medical studies. In Proc. of EMNLP.
  10. SummEval: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics.
  11. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Proc. of AACL.
  12. Scim: Intelligent skimming support for scientific papers. In Proceedings of the 28th International Conference on Intelligent User Interfaces.
  13. Factorizing content and budget decisions in abstractive summarization of long documents. In Proc. of EMNLP.
  14. How to write summaries with patterns? learning towards abstractive summarization through prototype editing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3741–3751.
  15. Umse: Unified multi-scenario summarization evaluation. Proc. of ACL findings.
  16. Longt5: Efficient text-to-text transformer for long sequences. In Proc. of ACL Findings.
  17. Evaluation of pico as a knowledge representation for clinical questions. In AMIA annual symposium proceedings.
  18. Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
  19. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In Proc. of EMNLP.
  20. Pubmedqa: A dataset for biomedical research question answering. In Proc. of EMNLP.
  21. Waleed Kadous. 2023. Llama 2 is about as factually accurate as gpt-4 for summaries and is 30x cheaper. https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper.
  22. An empirical survey on long document summarization: Datasets, models, and metrics. ACM computing surveys.
  23. Longeval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proc. of EACL.
  24. Evaluating the factual consistency of abstractive text summarization. In Proc. of EMNLP.
  25. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. of ACL.
  26. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
  27. Roberta: A robustly optimized bert pretraining approach. ArXiv.
  28. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. Proc. of ACL.
  29. Towards interpretable and efficient automatic reference-based summarization evaluation. arXiv preprint arXiv:2303.03608.
  30. Questeval: Summarization asks for fact-based evaluation. In Proc. of EMNLP.
  31. Crowdsourcing lightweight pyramids for manual summary evaluation. In Proc. of AACL.
  32. Luciana B Sollaci and Mauricio G Pereira. 2004. The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. Journal of the medical library association, 92(3):364.
  33. Evaluating the factual consistency of large language models through summarization. arXiv preprint arXiv:2211.08412.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Generating (factual?) narrative summaries of rcts: Experiments with neural multi-document summarization. AMIA Summits on Translational Science Proceedings.
  36. Automated metrics for medical multi-document summarization disagree with human evaluations. ACL.
  37. Wen Xiao and Giuseppe Carenini. 2019. Extractive summarization of long documents by combining global and local context. In Proc. of EMNLP.
  38. BARTScore: Evaluating generated text as text generation. In Proc. of NeurIPS.
  39. Big bird: Transformers for longer sequences. Proc. of NeurIPS.
  40. Hegel: Hypergraph transformer for long document summarization. In Proc. of EMNLP.
  41. Bertscore: Evaluating text generation with bert. In Proc. of ICLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiuying Chen (80 papers)
  2. Tairan Wang (4 papers)
  3. Qingqing Zhu (16 papers)
  4. Taicheng Guo (11 papers)
  5. Shen Gao (49 papers)
  6. Zhiyong Lu (113 papers)
  7. Xin Gao (209 papers)
  8. Xiangliang Zhang (131 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.