SKT5SciSumm -- Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization (2402.17311v2)
Abstract: Summarization for scientific text has shown significant benefits both for the research community and human society. Given the fact that the nature of scientific text is distinctive and the input of the multi-document summarization task is substantially long, the task requires sufficient embedding generation and text truncation without losing important information. To tackle these issues, in this paper, we propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS). We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering. We employ the T5 family of models to generate abstractive summaries using extracted sentences. SKT5SciSumm achieves state-of-the-art performance on the Multi-XScience dataset. Through extensive experiments and evaluation, we showcase the benefits of our model by using less complicated models to achieve remarkable results, thereby highlighting its potential in advancing the field of multi-document summarization for scientific text.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Capturing relations between scientific papers: An abstractive model for related work section generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6068–6077, Online. Association for Computational Linguistics.
- Specter: Document-level representation learning using citation-informed transformers.
- Automatic text summarization: A comprehensive survey. Expert Systems with Applications, 165:113679.
- Günes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res., 22(1):457–479.
- Proposition-level clustering for multi-document summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1765–1779, Seattle, United States. Association for Computational Linguistics.
- Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
- Cong Duy Vu Hoang and Min-Yen Kan. 2010. Towards automated related work summarization. In Coling 2010: Posters, pages 427–435, Beijing, China. Coling 2010 Organizing Committee.
- Yue Hu and Xiaojun Wan. 2014. Automatic generation of related work sections in scientific papers: An optimization approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1624–1633, Doha, Qatar. Association for Computational Linguistics.
- Deconstructing human literature reviews – a framework for multi-document summarization. In Proceedings of the 14th European Workshop on Natural Language Generation, pages 125–135, Sofia, Bulgaria. Association for Computational Linguistics.
- Surveyor: A system for generating coherent survey articles for scientific topics. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1).
- Xin Jin and Jiawei Han. 2010. K-Means Clustering, pages 563–564. Springer US, Boston, MA.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Yang Liu and Mirella Lapata. 2019. Hierarchical transformers for multi-document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5070–5081, Florence, Italy. Association for Computational Linguistics.
- ClinicalT5: A generative language model for clinical text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5436–5443, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Multi-XScience: A large-scale dataset for extreme multi-document summarization of scientific articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8068–8074, Online. Association for Computational Linguistics.
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
- Using citations to generate surveys of scientific paradigms. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 584–592, Boulder, Colorado. Association for Computational Linguistics.
- Centroid-based summarization of multiple documents. Information Processing & Management, 40(6):919–938.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.
- A thorough evaluation of task-specific pretraining for summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 140–145, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Improving multi-document summarization through referenced flexible extraction with credit-awareness. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1667–1681, Seattle, United States. Association for Computational Linguistics.
- Kaito Sugimoto and Akiko Aizawa. 2022. Incorporating the rhetoric of scientific language into sentence embeddings using phrase-guided distant supervision and metric learning. In Proceedings of the Third Workshop on Scholarly Document Processing, pages 54–68, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 181–184, New York City, USA. Association for Computational Linguistics.
- Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, page 307–314, New York, NY, USA. Association for Computing Machinery.
- Multi-document scientific summarization from a knowledge graph-centric view. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6222–6233, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Neural related work summarization with a joint context-driven attention mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1776–1786, Brussels, Belgium. Association for Computational Linguistics.
- PRIMERA: Pyramid-based masked sentence pre-training for multi-document summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5245–5263, Dublin, Ireland. Association for Computational Linguistics.
- Huy Quoc To (5 papers)
- Hung-Nghiep Tran (2 papers)
- Andr'e Greiner-Petter (1 paper)
- Felix Beierle (20 papers)
- Akiko Aizawa (74 papers)
- Ming Liu (421 papers)
- Guangyan Huang (6 papers)