Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
90 tokens/sec
Gemini 2.5 Pro Premium
54 tokens/sec
GPT-5 Medium
19 tokens/sec
GPT-5 High Premium
18 tokens/sec
GPT-4o
104 tokens/sec
DeepSeek R1 via Azure Premium
78 tokens/sec
GPT OSS 120B via Groq Premium
475 tokens/sec
Kimi K2 via Groq Premium
225 tokens/sec
2000 character limit reached

On Positional Bias of Faithfulness for Long-form Summarization (2410.23609v1)

Published 31 Oct 2024 in cs.CL

Abstract: LLMs often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization datasets and perform a meta-evaluation of faithfulness metrics. We show that LLM-based faithfulness metrics, though effective with full-context inputs, remain sensitive to document order, indicating positional bias. Analyzing LLM-generated summaries across six datasets, we find a "U-shaped" trend in faithfulness, where LLMs faithfully summarize the beginning and end of documents but neglect middle content. Perturbing document order similarly reveals models are less faithful when important documents are placed in the middle of the input. We find that this behavior is partly due to shifting focus with context length: as context increases, summaries become less faithful, but beyond a certain length, faithfulness improves as the model focuses on the end. Finally, we experiment with different generation techniques to reduce positional bias and find that prompting techniques effectively direct model attention to specific positions, whereas more sophisticated approaches offer limited improvements. Our data and code are available in https://github.com/meetdavidwan/longformfact.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. What are the desired characteristics of calibration sets? identifying correlates on long form scientific summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10520–10542, Toronto, Canada. Association for Computational Linguistics.
  2. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13787–13805, Bangkok, Thailand. Association for Computational Linguistics.
  3. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
  4. Booookscore: A systematic exploration of book-length summarization in the era of LLMs. In The Twelfth International Conference on Learning Representations.
  5. UniSumm and SummZoo: Unified model and diverse benchmark for few-shot summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12833–12855, Toronto, Canada. Association for Computational Linguistics.
  6. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana. Association for Computational Linguistics.
  7. The llama 3 herd of models. Preprint, arXiv:2407.21783.
  8. Summary-source proposition-level alignment: Task, datasets and supervised baseline. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 310–322, Online. Association for Computational Linguistics.
  9. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  10. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
  11. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  12. GPTScore: Evaluate as you desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556–6576, Mexico City, Mexico. Association for Computational Linguistics.
  13. News summarization and evaluation in the era of gpt-3. Preprint, arXiv:2209.12356.
  14. Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 570–593, Mexico City, Mexico. Association for Computational Linguistics.
  15. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
  16. Mixtral of experts. Preprint, arXiv:2401.04088.
  17. Fables: Evaluating faithfulness and content selection in book-length summarization. Preprint, arXiv:2404.01261.
  18. FABLES: Evaluating faithfulness and content selection in book-length summarization. In First Conference on Language Modeling.
  19. How far are we from robust long abstractive summarization? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2682–2698, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  20. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics.
  21. Summary of a haystack: A challenge to long-context llms and rag systems. arXiv preprint arXiv:https://arxiv.org/pdf/2407.01370.
  22. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  23. Long-context llms struggle with long in-context learning. Preprint, arXiv:2404.02060.
  24. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
  25. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  26. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  27. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
  28. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4481–4501, Mexico City, Mexico. Association for Computational Linguistics.
  29. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  30. Multi-XScience: A large-scale dataset for extreme multi-document summarization of scientific articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8068–8074, Online. Association for Computational Linguistics.
  31. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.
  32. OpenAI. 2024. Hello gpt-4o.
  33. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. Preprint, arXiv:2308.11483.
  34. Summarization is (almost) dead. Preprint, arXiv:2309.09558.
  35. On context utilization in summarization with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2764–2781, Bangkok, Thailand. Association for Computational Linguistics.
  36. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  37. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  38. Positional description matters for transformers arithmetic. Preprint, arXiv:2311.14737.
  39. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  40. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.
  41. Minicheck: Efficient fact-checking of llms on grounding documents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  42. Minicheck: Efficient fact-checking of llms on grounding documents. Preprint, arXiv:2404.10774.
  43. TofuEval: Evaluating hallucinations of LLMs on topic-focused dialogue summarization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4455–4480, Mexico City, Mexico. Association for Computational Linguistics.
  44. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2327–2340, Mexico City, Mexico. Association for Computational Linguistics.
  45. Mutual information alleviates hallucinations in abstractive summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5956–5965, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  46. SQuALITY: Building a long-document summarization dataset the hard way. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139–1156, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  47. Is ChatGPT a good NLG evaluator? a preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Singapore. Association for Computational Linguistics.
  48. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, Bangkok, Thailand. Association for Computational Linguistics.
  49. AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  50. Fine-grained natural language inference based faithfulness evaluation for diverse summarisation tasks. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1701–1722, St. Julian’s, Malta. Association for Computational Linguistics.
  51. Fine-grained natural language inference based faithfulness evaluation for diverse summarisation tasks. Preprint, arXiv:2402.17630.
  52. Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics, 12:39–57.
  53. Fair abstractive summarization of diverse perspectives. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3404–3426, Mexico City, Mexico. Association for Computational Linguistics.
  54. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
  55. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.

Summary

  • The paper reveals that LLMs display a U-shaped faithfulness pattern by focusing on the beginning and end of documents, leading to neglected middle sections.
  • It introduces a comprehensive evaluation framework using eight human-annotated datasets to quantify the impact of positional bias on summary accuracy.
  • The study tests mitigation strategies, such as explicit prompts and hierarchical merging, and highlights the need for further research on adaptive attention mechanisms.

Positional Bias of Faithfulness in Long-form Summarization

The paper entitled "On Positional Bias of Faithfulness for Long-form Summarization" examines the perplexing phenomenon of positional bias within LLMs and its implications on faithfulness in long-context scenarios. This investigation is critical, given the widespread reliance on LLMs for generating long-form summaries—a task that demands attention to detail, comprehensive coverage, and factual consistency across lengthy documents.

Positional Bias in Long-form Summarization

LLMs have demonstrated capable of producing high-quality summaries; however, their performance in long-form contexts is often compromised by positional bias. Specifically, LLMs tend to neglect the middle portions of texts, a manifestation known as the "lost-in-the-middle" trend. Through an empirical analysis of various datasets, the researchers identify a "U-shaped" faithfulness pattern where LLMs focus more intently on the initial and final sections of documents, consequently omitting vital information in the middle segments. Such biases can lead to hallucinations, as models hypothesize content rather than relying on overlooked sections—a significant challenge in ensuring faithfulness.

Methodological Approach

The authors establish an evaluation framework to consistently measure faithfulness across long-form summarization tasks. This involves creating a benchmark comprising eight human-annotated datasets and conducting a meta-evaluation of faithfulness metrics. Through experimentation, LLM-based metrics have been found both effective and sensitive to document order alterations, underscoring an inherent positional bias.

A pivotal aspect of this research is the analysis of different models and datasets to ascertain how positional bias impacts summary fidelity. The paper highlights that, as context length increases, models initially stray from faithfulness due to middle-section neglect, but later regain it by concentrating on document ends.

Mitigation Techniques

In exploring techniques to mitigate positional bias, the authors experiment with several generation methodologies. Simple modifications, such as explicit prompts to focus on certain document sections, show promise in guiding LLMs toward neglected parts. Conversely, more sophisticated strategies like hierarchical merging or incremental updates yield limited enhancements and often exacerbate faithfulness issues.

Implications and Future Directions

The exploration of positional biases opens pathways for refining LLMs to enhance their usability in academia, industry, and other domains requiring accurate long-context comprehension. In practical terms, addressing these biases could improve automated reporting systems, educational tools, and any technology relying on precise information extraction. Theoretically, this line of research can propel further studies into the structural and functional adaptations of neural architectures, potentially leading to more balanced attention mechanisms within LLMs.

For future undertakings, the authors suggest an extension of these methodologies to more extensive datasets and models, advocating for deeper investigation into adaptive strategies that can dynamically adjust attention allocation based on input complexity. They also stress the importance of developing more robust metrics that can better accommodate variations introduced by input rearrangement, aiming to bolster LLM faithfulness regardless of document layout.

In summary, the paper presents a thorough and methodical evaluation of how LLMs handle long-form summarization tasks, revealing critical limitations while proposing avenues for progress. As LLMs continue to evolve, addressing positional bias will be crucial in leveraging their full potential for nuanced and extensive text generation applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.