On Context Utilization in Summarization with Large Language Models (2310.10570v6)
Abstract: LLMs excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, LLMs exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum.
- From sparse to dense: Gpt-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269.
- When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785.
- SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Revisiting zero-shot abstractive summarization in the era of large language models from the perspective of position bias. arXiv preprint arXiv:2401.01989.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana. Association for Computational Linguistics.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
- News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
- Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
- Multi-dimensional evaluation of text summarization with in-context learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8487–8495, Toronto, Canada. Association for Computational Linguistics.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
- Abstractive summarization of Reddit posts with multi-level memory networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2519–2531, Minneapolis, Minnesota. Association for Computational Linguistics.
- Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- On learning to summarize with large language models as references. arXiv preprint arXiv:2305.14239.
- Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981.
- Towards interpretable and efficient automatic reference-based summarization evaluation. arXiv preprint arXiv:2303.03608.
- BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
- Multi-XScience: A large-scale dataset for extreme multi-document summarization of scientific articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8068–8074, Online. Association for Computational Linguistics.
- Chatgpt as a factual inconsistency evaluator for text summarization.
- Frank J Massey Jr. 1951. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
- Xgen-7b technical report. arXiv preprint arXiv:2309.03450.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Long document summarization with top-down and bottom-up inference. arXiv preprint arXiv:2203.07586.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
- Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
- Summarization is (almost) dead. arXiv preprint arXiv:2309.09558.
- Towards summary candidates fusion. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8488–8504, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
- SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091.
- Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737.
- Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
- Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv preprint arXiv:2310.07712.
- Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8640–8665, Toronto, Canada. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- PRIMERA: Pyramid-based masked sentence pre-training for multi-document summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5245–5263, Dublin, Ireland. Association for Computational Linguistics.
- Alleviating exposure bias via multi-level contrastive learning and deviation simulation in abstractive summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9732–9747, Toronto, Canada. Association for Computational Linguistics.
- Adapting pretrained text-to-text models for long text sequences. arXiv preprint arXiv:2209.10052.
- Summit: Iterative text summarization via chatgpt. arXiv preprint arXiv:2305.14835.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
- NarraSum: A large-scale dataset for abstractive narrative summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 182–197, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045.
- Large language models are not robust multiple choice selectors. arXiv e-prints, pages arXiv–2309.
- Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–663, Melbourne, Australia. Association for Computational Linguistics.
- Mathieu Ravaut (17 papers)
- Aixin Sun (99 papers)
- Nancy F. Chen (97 papers)
- Shafiq Joty (187 papers)