Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NarrativeXL: A Large-scale Dataset For Long-Term Memory Models (2305.13877v2)

Published 23 May 2023 in cs.CL and cs.AI

Abstract: We propose a new large-scale (nearly a million questions) ultra-long-context (more than 50,000 words average document length) reading comprehension dataset. Using GPT 3.5, we summarized each scene in 1,500 hand-curated fiction books from Project Gutenberg, which resulted in approximately 150 scene-level summaries per book. After that, we created a number of reading comprehension questions based on these summaries, including three types of multiple-choice scene recognition questions, as well as free-form narrative reconstruction questions. With 990,595 total questions, our dataset is an order of magnitude larger than the closest alternatives. Crucially, most questions have a known ``retention demand'', indicating how long-term of a memory is needed to answer them, which should aid long-term memory performance evaluation. We validate our data in four small-scale experiments: one with human labelers, and three with existing LLMs. We show that our questions 1) adequately represent the source material 2) can be used to diagnose a model's memory capacity 3) are not trivial for modern LLMs even when the memory demand does not exceed those models' context lengths. Lastly, we provide our code which can be used to further expand the dataset with minimal human labor.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
  4. Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118, 2023.
  5. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
  6. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  7. We can explain your research in layman’s terms: Towards automating science journalism at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12728–12737, 2021.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.
  10. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
  11. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209, 2021.
  12. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  13. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  14. A. Moskvichev and J. A. Liu. Updater-extractor architecture for inductive world state representations. arXiv preprint arXiv:2104.05500, 2021.
  15. OpenAI. Gpt-4 technical report, 2023.
  16. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, July 2022. Association for Computational Linguistics.
  17. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  18. Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533, 2022.
  19. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196, 2023.
  20. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  21. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  22. Towards ai-complete question answering: A set of prerequisite toy tasks. International Conference on Learning Representations, 2016.
  23. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  24. What gives the answer away? question answering bias analysis on video qa datasets. arXiv preprint arXiv:2007.03626, 2020.
  25. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  26. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021.
  27. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Arseny Moskvichev (5 papers)
  2. Ky-Vinh Mai (1 paper)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub