Benchmarking LLMs on the Semantic Overlap Summarization Task
Abstract: Semantic Overlap Summarization (SOS) is a constrained multi-document summarization task, where the constraint is to capture the common/overlapping information between two alternative narratives. While recent advancements in LLMs have achieved superior performance in numerous summarization tasks, a benchmarking study of the SOS task using LLMs is yet to be performed. As LLMs' responses are sensitive to slight variations in prompt design, a major challenge in conducting such a benchmarking study is to systematically explore a variety of prompts before drawing a reliable conclusion. Fortunately, very recently, the TELeR taxonomy has been proposed which can be used to design and explore various prompts for LLMs. Using this TELeR taxonomy and 15 popular LLMs, this paper comprehensively evaluates LLMs on the SOS Task, assessing their ability to summarize overlapping information from multiple alternative narratives. For evaluation, we report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives. We conclude the paper by analyzing the strengths and limitations of various LLMs in terms of their capabilities in capturing overlapping information The code and datasets used to conduct this study are available at https://anonymous.4open.science/r/llm_eval-E16D.
- Revisiting automatic evaluation of extractive summarization task: Can we do better than rouge? In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1547–1560. Association for Computational Linguistics.
- Palm 2 technical report. ArXiv:2305.10403 [cs].
- Summary level training of sentence rewriting for abstractive summarization. arXiv preprint arXiv:1909.08752.
- Learning to generate overlap summaries through noisy synthetic data. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, page 11765–11777, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Sem-f1: an automatic way for semantic evaluation of multi-narrative overlap summaries at scale. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, page 780–792, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Semantic overlap summarization among multiple alternative narratives: An exploratory study. In Proceedings of the 29th International Conference on Computational Linguistics, page 6195–6207, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Semantic overlap summarization among multiple alternative narratives: An exploratory study. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6195–6207, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- What does it mean for a language model to preserve privacy? In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2280–2292.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Retrieve, rerank and rewrite: Soft template based neural summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 152–161, Melbourne, Australia. Association for Computational Linguistics.
- Universal sentence encoder for english. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, page 169–174, Brussels, Belgium. Association for Computational Linguistics.
- Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080.
- Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
- Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749.
- Multi-document summarization by sentence extraction. In NAACL-ANLP 2000 Workshop: Automatic Summarization.
- Advancing candidate link generation for requirements tracing: the study of methods. IEEE Transactions on Software Engineering, 32(1):4–19.
- A unified model for extractive and abstractive summarization using inconsistency loss. arXiv preprint arXiv:1805.06266.
- Lora: Low-rank adaptation of large language models.
- Mistral 7b. ArXiv:2310.06825 [cs].
- Sofsat: Towards a setlike operator based framework for semantic analysis of text. ACM SIGKDD Explorations Newsletter, 20(2):21–30.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA. Association for Computing Machinery.
- Adapting the neural encoder-decoder framework from single to multi-document summarization. arXiv preprint arXiv:1808.06218.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Generative adversarial network for abstractive text summarization. arXiv preprint arXiv:1711.09357.
- Multi-document summarization via deep learning techniques: A survey. arXiv preprint arXiv:2011.04843.
- Survey on graph and cluster based approaches in multi-document text summarization. In International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014), pages 1–5. IEEE.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
- Ranking sentences for extractive summarization with reinforcement learning. arXiv preprint arXiv:1802.08636.
- OpenAI. 2023. Gpt-4 technical report. ArXiv:2303.08774 [cs].
- Training language models to follow instructions with human feedback. ArXiv:2203.02155 [cs].
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Better language models and their implications. OpenAI Blog https://openai. com/blog/better-language-models.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Prompts matter: Insights and strategies for prompt engineering in automated software traceability. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), page 455–464.
- A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- Shubhra Kanti Karmaker Santu and Dongji Feng. 2023. Teler: A general taxonomy of llm prompts for benchmarking complex tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, page 14197–14203, Singapore. Association for Computational Linguistics.
- Exploring universal sentence encoders for zero-shot text classification. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), page 135–147, Online only. Association for Computational Linguistics.
- Zero-shot multi-label topic inference with sentence encoders and llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page 16218–16233, Singapore. Association for Computational Linguistics.
- Stanford Law School. 2023. Large language models as fiduciaries: A case study toward robustly communicating with artificial intelligence through legal standards.
- An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105.
- Compositional task representations for large language models. In The Eleventh International Conference on Learning Representations.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- MosaicMLÂ NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2024-01-30.
- Large language models in medicine. Nature Medicine, 29(88):1930–1940.
- Llama: Open and efficient foundation language models. ArXiv:2302.13971 [cs].
- Llama 2: Open foundation and fine-tuned chat models. ArXiv:2307.09288 [cs].
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
- The creation and analysis of a website privacy policy corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 1330–1340, Berlin, Germany. Association for Computational Linguistics.
- Huggingface’s transformers: State-of-the-art natural language processing. ArXiv:1910.03771 [cs].
- Yuxiang Wu and Baotian Hu. 2018. Learning to extract coherent summary via deep reinforcement learning. arXiv preprint arXiv:1804.07036.
- Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. arXiv preprint arXiv:2001.11314.
- Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063.
- Graph-based neural multi-document summarization. arXiv preprint arXiv:1706.06681.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777.
- Summpip: Unsupervised multi-document summarization with sentence graph compression. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1949–1952.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv:2306.05685 [cs].
- Extractive summarization as text matching. arXiv preprint arXiv:2004.08795.
- Searching for effective neural extractive summarization: What works and what’s next. arXiv preprint arXiv:1907.03491.
- Selective encoding for abstractive sentence summarization. arXiv preprint arXiv:1704.07073.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
- A robustly optimized bert pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, page 1218–1227, Huhhot, China. Chinese Information Processing Society of China.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.