NarrativeXL: Ultra-Long Context QA Dataset
- NarrativeXL is a large-scale ultra-long-context reading comprehension dataset featuring nearly 1M questions from 1,500 curated fiction books.
- It employs detailed scene segmentation, automated summarization, and diverse question types to challenge models with retention demands spanning up to 43,536 words.
- Experimental diagnostics reveal memory decay patterns in language models, informing curriculum finetuning and improvements in long-context architectures.
NarrativeXL is a large-scale, ultra-long-context reading comprehension dataset explicitly designed to evaluate and diagnose the long-term memory capabilities of LMs. Comprising nearly one million questions derived from over 1,500 hand-curated fiction books, NarrativeXL distinguishes itself by unprecedented document lengths (averaging over 50,000 words), scene-level granularity, fine-grained retention demand annotations, and a mix of multiple-choice and free-form tasks. The dataset supports both direct diagnostic evaluation and memory-centric pretraining of LMs, and all construction code is openly available for further expansion and reproducibility (Moskvichev et al., 2023).
1. Source Text Curation and Scene Summarization
NarrativeXL construction begins with manual and automated processing of the full Project Gutenberg fiction catalog. Only “single-connected” narratives are retained, with non-narrative entries (diaries, short-story collections, letters) removed, extraneous front/back matter stripped, and duplicates eliminated. The final corpus contains 1,500 highly curated books, each averaging ~160,000 words after cleaning.
Each book is divided into overlapping chunks of approximately 3,000 symbols (≈500 words, overlapping by 300 symbols), maximizing contextual integrity for scene segmentation. Each chunk is summarized using GPT-3.5 with a precise system/user prompt structure, yielding concise, context-faithful “scene summaries.” Per book, this process generates ~150 summaries, amounting to roughly 50,000 words per book post-summarization.
2. Automated Question and Task Generation
NarrativeXL incorporates four question types targeting both recognition and narrative reconstruction:
- Read-Along (Scene Recognition) Questions:
For each question, the LM receives as context a book prefix up to position and must answer whether a described scene (identified by summary ) occurred within this prefix. Each multiple-choice question includes one true option, "None of the above," and three distractors: 1. Lookahead: True summary from a later book section. 2. Other-book: True summary from a different book, with character names replaced. 3. Scene-distortion: GPT-3.5 generated plausible but incorrect scene, preserving style/setting.
- Free-form Scene Summary Reconstruction:
The LM is presented with a distorted scene summary and tasked with rewriting it to match the original narrative.
- Hierarchical Summary Reconstruction:
NarrativeXL employs recursive summarization to generate coarse-to-fine grained book abstractions. At each abstraction level, the LM reconstructs the correct summary from a distorted input.
The resulting corpus statistics:
| Question Type | # Items | Avg. Doc Length (words) | Avg. Context Length | Avg. Retention Demand |
|---|---|---|---|---|
| Read-Along (multiple choice) | 726,803 | 54,334* | 54,334 | 31,931 |
| Scene summary reconstruction (free) | 244,111 | 87,051 | 87,051 | – |
| Hierarchical reconstruction (free) | 19,681 | 87,051 | 87,051 | – |
| Total | 990,595 | — | — | — |
*The "Avg. doc length" for Read-Along refers to the effective book length at question time.
3. Retention Demand and Diagnostic Memory Evaluation
A core innovation is the annotation of retention demand (RD)—the amount of narrative memory the LM must retain to answer a question. For a Read-Along question whose correct answer refers to a scene at chunk index , and where the question is posed at chunk index ():
Across all Read-Along questions, the mean is 31,931 words (SD 36,597); values range up to 43,536 words (IQR). This granularity allows systematic dissection of a model’s memory decay as a function of narrative distance, complementing raw context window benchmarking.
This design enables plotting “forgetting curves” for LMs and benchmarking “lost in the middle” effects—where models increasingly fail to recall distant information despite token limits not being saturated.
4. Evaluation Experiments and Metrics
Four small-scale validation experiments were conducted:
- Human Adequacy Check: Human annotators selected the correct scene summary from pairs (true vs. false) for 250 samples, achieving 0.95 accuracy (95% CI [0.92, 0.97]).
- BERT Baseline (Scene-Distortion): Fine-tuned BERT on answer options alone (ignoring context) achieved 0.524 accuracy (random baseline 0.167), indicating that distractors are nontrivial.
- Zero-shot Multiple-Choice (Short Context): On 60 Read-Along questions with low RD (≤8 scenes, ~4,000 words), Claude v1.3-100k scored 0.53 (95% CI [0.40, 0.66]) and GPT-4 scored 0.783 (95% CI [0.66, 0.88]).
- Summary Reconstruction: For free-form reconstruction, GPT-4 with full context achieved ROUGE-1 F1 of 0.576 and BertScore F1 of 0.913. The absence of trivial shortcuts was confirmed: models not provided the actual book performed substantially worse.
Performance degrades as RD increases; e.g., Claude 2.0 achieves 0.51 accuracy in the “short” context regime (~4,077 words) and drops to 0.26 in “long” contexts (~50,977 words), closely mirroring contemporary understanding of context-length limitations in LMs.
5. Scaling, Automation, and Expansion
Dataset generation is fully automated after minimal manual curation. Book processing, scene summarization, distortion, hierarchical abstraction, question generation, and entity substitution are orchestrated with open-source code. Expanding the dataset to new books incurs minimal marginal human effort and a cost of approximately $0.30 per book.
Named-entity randomization mechanisms minimize contamination for popular titles, and best practices recommend verifying entity shuffling before model training or evaluation. Chunking or retrieval augmentation is recommended where computational budgets or model context windows are constrained.
6. Usage Recommendations and Research Applications
NarrativeXL is designed for several experimental paradigms:
- Memory Evaluation:
LMs’ outputs are probed at various narrative positions, measuring accuracy against RD and generating granular forgetting curves.
- Curriculum Finetuning:
The retention demand annotation allows curriculum learning by incrementally increasing memory load.
- Long-Context Architecture Diagnostics:
Performance stratification by difficulty bins ($RD$ ranges) enables precise identification of model capacity thresholds.
- Pretraining and Continual Learning:
The reconstruction tasks target narrative compression and memory extension for pretraining ultra-long-context models.
Researchers are cautioned to analyze scene-distortion distractors separately (due to a minor stylistic bias) and to monitor model regimes across difficulty bins (e.g., 0–5,000, 5,000–20,000, 20,000–50,000, 50,000+ words RD).
7. Implications and Limitations
NarrativeXL’s coverage, scale, and design make it a uniquely powerful resource for diagnosing long-term narrative retention and context-window bottlenecks in LMs (Moskvichev et al., 2023). Its nearly one-million-question scope and explicit RD annotation surpass prior narrative QA datasets. The observed negative correlations between RD and accuracy underscore the persistent challenge of ultra-long-context reasoning for modern LMs even within context limits—a key subject for ongoing research.
A plausible implication is that NarrativeXL can accelerate progress in ultra-long-context modeling, retrieval augmentation, and pragmatic memory-architecture design. The open pipeline allows rapid expansion and continued benchmarking as new LMs are developed.